Contents

Dell Streaming Data Platform 1.2 Software Developer's Guide PDF

1 of 145
1 of 145

Summary of Content for Dell Streaming Data Platform 1.2 Software Developer's Guide PDF

Dell EMC Streaming Data Platform Developer's Guide

Version 1.2

March 2021

Notes, cautions, and warnings

NOTE: A NOTE indicates important information that helps you make better use of your product.

CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid

the problem.

WARNING: A WARNING indicates a potential for property damage, personal injury, or death.

2020 - 2021 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.

Figures..........................................................................................................................................7

Tables........................................................................................................................................... 8

Chapter 1: Introduction................................................................................................................. 9 Product summary................................................................................................................................................................ 9 Product components.........................................................................................................................................................10

About Pravega............................................................................................................................................................... 11 About analytic engines and Pravega connectors.................................................................................................. 11 Management plane.......................................................................................................................................................12

Deployment options...........................................................................................................................................................12 Component deployment matrix...................................................................................................................................... 13 Architecture and supporting infrastructure ................................................................................................................13 Product highlights..............................................................................................................................................................15 More features..................................................................................................................................................................... 16 Basic terminology............................................................................................................................................................... 18 Interfaces............................................................................................................................................................................. 18

User Interface (UI).......................................................................................................................................................19 Grafana dashboards.....................................................................................................................................................19 Apache Flink Web UI...................................................................................................................................................20 Apache Spark Web UI................................................................................................................................................. 21 APIs..................................................................................................................................................................................21

What you get with SDP.................................................................................................................................................... 21 Use case examples............................................................................................................................................................22 Documentation resources............................................................................................................................................... 22

Chapter 2: Using the Streaming Data Platform (SDP)................................................................. 24 About installation and administration........................................................................................................................... 24 User access roles.............................................................................................................................................................. 24 About project isolation..................................................................................................................................................... 25 Log on as a new user........................................................................................................................................................25 Change your password or other profile attributes ................................................................................................... 27 Overview of the project-member user interface.......................................................................................................27 Analytics projects and Pravega scopes views............................................................................................................28

View projects and the project dashboard..............................................................................................................28 Pravega scopes view..................................................................................................................................................29 Pravega streams view................................................................................................................................................ 29

Create analytics projects and clusters......................................................................................................................... 30 About analytics projects............................................................................................................................................ 30 About Flink clusters..................................................................................................................................................... 31 Components of a Flink setup.....................................................................................................................................31 About Flink jobs............................................................................................................................................................ 31 Create Flink clusters....................................................................................................................................................31 Change Flink cluster attributes................................................................................................................................ 34 Delete a Flink cluster.................................................................................................................................................. 35

Contents

Contents 3

About stream processing analytics engines and Pravega connectors................................................................. 35 About the Apache Flink integration......................................................................................................................35 About the Apache SparkTM integration..................................................................................................................35

Chapter 3: Working with Apache Flink Applications..................................................................... 37 About hosting artifacts in Maven repositories........................................................................................................... 37

About artifacts for Flink projects.............................................................................................................................37 Upload Flink application artifacts.............................................................................................................................38 Deploy new Flink applications...................................................................................................................................38 View and delete artifacts...........................................................................................................................................40 Connect to the Maven repository using the mvn CLI........................................................................................ 40

Using Gradle.................................................................................................................................................................. 41 About Apache Flink applications.....................................................................................................................................41 Application lifecycle and scheduling............................................................................................................................. 42

Apache Flink life cycle................................................................................................................................................ 42 View status and delete applications..............................................................................................................................44

View and edit Flink applications............................................................................................................................... 44 View application properties.......................................................................................................................................44 View application savepoints......................................................................................................................................45 View application events............................................................................................................................................. 45 View application deployment logs........................................................................................................................... 45 View deployed applications in the Apache Flink UI............................................................................................. 45

Chapter 4: Working with Apache Spark Applications....................................................................47 About Apache Spark applications.................................................................................................................................. 47 Apache Spark terminology.............................................................................................................................................. 48 Apache Spark life cycle....................................................................................................................................................48 Understanding Spark clusters........................................................................................................................................ 49 Create a new Spark application..................................................................................................................................... 49 Create a new artifact........................................................................................................................................................51 View status of Spark applications................................................................................................................................. 53 View the properties of Spark applications.................................................................................................................. 54 View the history of a Spark application....................................................................................................................... 55 View the events for Spark applications....................................................................................................................... 55 View the logs for Spark applications............................................................................................................................ 56 Deploying Spark applications.......................................................................................................................................... 57

Uploading common artifacts to your analytics project.......................................................................................57 Deploying Python applications using the SDP UI................................................................................................ 58 Deploying Python applications using Kubectl....................................................................................................... 59 Deploying Java or Scala applications using the SDP UI......................................................................................61

Troubleshooting Spark applications..............................................................................................................................62

Chapter 5: Working with Analytic Metrics................................................................................... 63 About analytic metrics for projects...............................................................................................................................63 Enable analytic metrics for a project............................................................................................................................64 View analytic metrics ...................................................................................................................................................... 64 Add custom metrics to an application..........................................................................................................................65 Analytic metrics predefined dashboard examples..................................................................................................... 66

4 Contents

Chapter 6: Working with Pravega Streams.................................................................................. 70 About Pravega streams................................................................................................................................................... 70 About stateful stream processing.................................................................................................................................. 71 About event-driven applications.....................................................................................................................................71 About exactly once semantics........................................................................................................................................ 71

Pravega exactly once semantics..............................................................................................................................72 Apache Flink exactly once semantics..................................................................................................................... 72

Understanding checkpoints and savepoints............................................................................................................... 73 About Pravega scopes..................................................................................................................................................... 73

Accessing Pravega scopes and streams................................................................................................................ 73 About cross project scope sharing.......................................................................................................................... 73

About Pravega streams................................................................................................................................................... 75 Create data streams......................................................................................................................................................... 75

Create a stream from an existing Pravega scope............................................................................................... 75 Stream access rules.................................................................................................................................................... 76

Stream configuration attributes.....................................................................................................................................77 List streams (scope view)...............................................................................................................................................78 View streams...................................................................................................................................................................... 79 Edit streams........................................................................................................................................................................ 81 Delete streams....................................................................................................................................................................81 Start and stop stream ingestion.................................................................................................................................... 82 Monitor stream ingestion................................................................................................................................................ 82 About Grafana dashboards to monitor Pravega........................................................................................................ 82

Chapter 7: Working with Pravega Schema Registry..................................................................... 83 About Pravega schema registry.....................................................................................................................................83 Create schema group....................................................................................................................................................... 88 List and delete schemas.................................................................................................................................................. 89 Add schema to streams...................................................................................................................................................90 Edit schema group and add codecs...............................................................................................................................91

Chapter 8: Working with Pravega Search (PSearch)....................................................................94 Overview............................................................................................................................................................................. 94

Two query types .........................................................................................................................................................95 Additional Pravega Search functionality................................................................................................................95 Limitations.....................................................................................................................................................................96

Using Pravega Search......................................................................................................................................................96 Create Pravega Search cluster................................................................................................................................ 97 Make a searchable stream........................................................................................................................................ 98 Continuous queries.....................................................................................................................................................101 Search queries............................................................................................................................................................ 108

Health and maintenance .................................................................................................................................................111 Monitor Pravega Search cluster metrics............................................................................................................... 111 Delete index documents ........................................................................................................................................... 111 Scale the Pravega Search cluster........................................................................................................................... 111

Pravega Search REST API............................................................................................................................................. 116 REST head....................................................................................................................................................................116 Authentication............................................................................................................................................................. 116

Contents 5

Resources..................................................................................................................................................................... 118 Methods documentation...........................................................................................................................................119

Chapter 9: Troubleshooting Application Deployments................................................................ 133 Apache Flink application failures.................................................................................................................................. 133

Review the full deployment log.............................................................................................................................. 135 Application logging...........................................................................................................................................................137

Review application logs............................................................................................................................................ 137 Application issues related to TLS.................................................................................................................................139 Apache Spark Troubleshooting.................................................................................................................................... 140

Chapter 10: SDP Code Hub Applications and Sample Code.......................................................... 141 Develop an example video processing application....................................................................................................141

Chapter 11: References.............................................................................................................. 143 Keycloak and Kubernetes user roles........................................................................................................................... 143 Additional resources........................................................................................................................................................ 144 Where to go for support................................................................................................................................................ 144 Provide feedback about this document..................................................................................................................... 145

6 Contents

1 SDP main components............................................................................................................................................ 10

2 SDP Core architecture............................................................................................................................................ 14

3 SDP Edge architecture............................................................................................................................................14

4 Initial administrator UI after login......................................................................................................................... 19

5 Apache Flink Web UI............................................................................................................................................... 20

6 Apache Spark Web UI..............................................................................................................................................21

7 Project Dashboard................................................................................................................................................... 28

8 Upload artifact..........................................................................................................................................................38

9 Apache Flink streaming data application lifecycle........................................................................................... 42

10 Create new Spark app............................................................................................................................................ 49

11 Spark app form.........................................................................................................................................................50

12 Create Maven artifact............................................................................................................................................ 52

13 Maven artifacts........................................................................................................................................................ 52

14 Create application file artifact..............................................................................................................................52

15 Application file artifacts......................................................................................................................................... 53

16 Spark App status..................................................................................................................................................... 54

17 Spark application properties................................................................................................................................. 54

18 Spark application history........................................................................................................................................55

19 Spark application events........................................................................................................................................56

20 Spark application logs.............................................................................................................................................56

21 Spark Jobs.................................................................................................................................................................68

22 Flink connector reader app example code: Ingestion and reader apps shared between different

projects.......................................................................................................................................................................74

23 Example in SDP: Ingestion and reader apps shared between different projects..................................... 74

24 Location of Search Cluster tab in the UI............................................................................................................97

25 Sample application for video processing with Pravega................................................................................ 142

Figures

Figures 7

1 Component deployment matrix............................................................................................................................. 13

2 Interfaces in SDP......................................................................................................................................................18

3 SDP documentation set......................................................................................................................................... 22

4 Pravega Search procedures summary................................................................................................................96

5 Pravega Search cluster services..........................................................................................................................111

Tables

8 Tables

Introduction This Developer's Guide is intended for application developers and data analysts with experience developing streaming analytics applications and provides guidance on how to use the Dell EMC Streaming Data Platform and Pravega streaming storage.

Topics:

Product summary Product components Deployment options Component deployment matrix Architecture and supporting infrastructure Product highlights More features Basic terminology Interfaces What you get with SDP Use case examples Documentation resources

Product summary Dell EMC Streaming Data Platform (SDP) is an autoscaling software platform for ingesting, storing, and processing continuously streaming unbounded data. The platform can process both real-time and collected historical data in the same application.

SDP ingests and stores streaming data, such as Internet of Things (IoT) devices, web logs, industrial automation, financial data, live video, social media feeds, and applications. It also ingests and stores event-based streams. It can process multiple data streams from multiple sources while ensuring low latencies and high availability.

The platform manages stream ingestion and storage and hosts the analytic applications that process the streams. It dynamically distributes processing related to data throughput and analytical jobs over the available infrastructure. It also dynamically autoscales storage resources to handle requirements in real time as the streaming workload changes.

SDP supports the concept of projects and project isolation or multi-tenancy. Multiple teams of developers and analysts all use the same platform, but each team has its own working environment. The applications and streams that belong to a team are protected from write access by other users outside of the team. Cross-team stream data sharing is supported in read-only mode.

SDP integrates the following capabilities into one software platform:

Stream ingestionThe platform is an autoscaling ingesting engine. It ingests all types of streaming data, including unbounded byte streams and event-based data in real time.

Stream storageElastic tiered storage provides instant access to real-time data, access to historical data, and near-infinite storage.

Stream processingReal-time stream processing is possible with an embedded analytics engine. Your stream processing applications can perform functions, such as: Process real-time and historical data. Process a combination of real-time and historical data in the same stream. Create and store new streams. Send notifications to enterprise alerting tools. Send output to third-party visualization tools.

Platform managementIntegrated management provides data security, configuration, access control, resource management, easy upgrade process, stream metrics collection, and health and monitoring features.

Run-time managementA web-based User Interface (UI) allows authorized users to configure stream properties, view stream metrics, run applications, view job status, and monitor system health.

1

Introduction 9

Application developmentThe product distribution includes APIs. The web UI supports application deployment and artifact storage.

Product components SDP is a software-only platform consisting of integrated components, supporting APIs, and Kubernetes Custom Resource Definitions (CRDs). This product runs in a Kubernetes environment.

Figure 1. SDP main components

Pravega Pravega is the stream store in SDP. It handles ingestion and storage for continuously streaming unbounded byte streams. Pravega is an Open Source Software project, which is sponsored by Dell EMC.

Unified Analytics SDP includes the following embedded analytic engines for processing your ingested stream data.

Apache Flink Apache Flink is an embedded stream processing engine in SDP. Dell EMC distributes Docker images from the Apache Flink Open Source Software project.

SDP ships with images for Apache Flink. It also supports custom Flink images.

Apache Spark Apache Spark is a unified analytics engine for large-scale data processing.

SDP ships with images for Apache Spark.

Pravega Search Pravega Search provides query features on the data in Pravega streams. It supports filtering and tagging incoming data as it is ingested as well as searching stored data.

For supported analytic engine image versions for this SDP release, see Component deployment matrix on page 13.

Management platform

The management platform is Dell EMC proprietary software. It integrates the other components and adds security, performance, configuration, and monitoring features.

User interface The management plane provides a comprehensive web-based user interface for administrators and application developers.

Metrics stacks SDP deploys InfluxDB databases and Grafana instances for metrics visualization. Separate stacks are deployed for Pravega and for each analytic project.

Pravega schema registry

The schema registry provides a serving and management layer for storing and retrieving schemas for Pravega streams.

APIs Various APIs are included in the SDP distributions. APIs for Spark, Flink, Pravega, Pravega Search, and Pravega Schema Registry are bundled in this SDP release.

10 Introduction

About Pravega

The Open Source Pravega project was created specifically to support streaming applications that handle large amounts of continuously arriving data.

In Pravega, the stream is a core primitive. Pravega ingests unbounded streaming data in real time and coordinates permanent storage.

Pravega user applications are known as Writers and Readers. Pravega Writers are applications using the Pravega API to ingest collected streaming data from several data sources into SDP. The platform ingests and stores the streams. Pravega Readers read data from the Pravega store.

Pravega streams are based on an append-only log data structure. By using append-only logs, Pravega rapidly ingests data into durable storage. Pravega handles all types of streams, including:

Unbounded or bounded streams of data Streams of discrete events or a continuous stream of bytes Sensor data, server logs, video streams, or any other type of information

Pravega seamlessly coordinates a two-tiered storage system for each stream. Bookkeeper (called Tier 1) stores the recently ingested tail of a stream temporarily. Long-term storage (sometimes called Tier 2) occurs in a configured alternate location. You can configure streams with specific data retention periods.

An application, such as a Java program reading from an IoT sensor, writes data to the tail of the stream. Apache Flink applications can read from any point in the stream. Multiple applications can read and write the same stream in parallel. Some of the important design features in Pravega are: Elasticity, scalability, and support for large volumes of streaming data Preserved ordering and exactly-once semantics Data retention based on time or size Durability Transaction support

Applications can access data in real time or past time in a uniform fashion. The same paradigm (the same API call) accesses both real-time and historical data in Pravega. Applications can also wait for data that is associated with any arbitrary time in the future.

Specialized software connectors provide access to Pravega. For example, a Flink connector provides Pravega data to Flink jobs. Because Pravega is an Open Source project, it can potentially connect to any analytics engine with community-contributed connectors.

Pravega is unique in its ability to handle unbounded streaming bytes. It is a high-throughput, autoscaling real-time store that preserves key-based ordering of continuously streaming data and guarantees exactly-once semantics. It infinitely tiers ingested data into long-term storage.

For more information about Pravega, see http://www.pravega.io.

About analytic engines and Pravega connectors

SDP includes analytic engines and connectors that enable access to Pravega streams.

Analytic engines run applications that analyze, consolidate, or otherwise process the ingested data.

Apache Flink Apache Flink is a high throughput, stateful analytics engine with precise control of time and state. It is an emerging market leader for processing stream data. Apache Flink provides a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It performs computations at in-memory speed and at any scale. Preserving the order of data during processing is guaranteed.

The Flink engine accommodates many types of stream processing models, including:

Continuous data pipelines for real-time analysis of unbounded streams Batch processing Publisher/subscriber pipelines

The SDP distribution includes Apache Flink APIs that can process continuous streaming data, sets of historical batch data, or combinations of both.

For more information about Apache Flink, see https://flink.apache.org/.

Introduction 11

Apache SparkTM Apache Spark provides a dataflow engine in which the user can express the required flow using transformation and actions. Data is handled through a Resilient Distributed Dataset (RDD) that is an immutable, partitioned dataset that transformations and operations work on. Applications dataflow graphs are broken down into stages, each stage creating a new RDD but importantly the RDD is not a materialized view on disk but rather an in-memory representation of the data held within the Spark cluster that later stages can process.

The SDP distribution includes Apache Spark APIs that can process streaming data, sets of historical batch data, or combinations of both. Spark has two processing modes: batch processing and streaming micro-batch.

For more information about Apache Spark, see https://spark.apache.org.

Other analytic engines and Pravega connectors

You may develop custom Pravega connectors to enable client applications to read from and write real- time data to Pravega streams. For more information, see the SDP Code Hub.

Management plane

The SDP management plane coordinates the interoperating functions of the other components.

The management plane deploys and manages components in the Kubernetes environment. It coordinates security, authentication, and authorization. It manages Pravega streams and the analytic applications in a single platform.

The web-based UI provides a common interface for all users. Developers can upload and update application images. All project members can manage streams and processing jobs. Administrators can manage resources and user access.

Some of the features of the management plane are:

Integrated data security, including TLS encryption, multilevel authentication, and role-based access control (RBAC) Project-based isolation for team members and their respective streams and applications Possibility of sharing stream data in a read-only mode across streams Flink cluster and application management Spark application management Pravega streams management Stream data schema management and evolution history with a Schema Registry DevOps oriented platform for modern software development and delivery Integrated Kubernetes container environment Application monitoring and direct access to the Apache Flink or Apache Spark web-based UIs Direct access to predefined Grafana dashboards for Pravega Direct access to project-specific predefined Grafana dashboards showing operational metrics for Flink, Spark, and Pravega

Search clusters

Deployment options Streaming Data Platform supports options for deployment at the network edge and in the data center core.

SDP Edge Deploying at the Edge, near gateways or sensors, has the advantage of local ingestion, transformation, and alerting. Data is processed, filtered, or enriched before transmission upstream to the Core.

SDP Edge is a small footprint deployment, requiring fewer minimum resources (CPU cores, RAM, and storage). It supports configurations of 1 or 3 nodes. Single-node deployment is an extremely low cost option for development, proof of concept, or for

production use cases where high availability (HA) is not required. Single-node deployment operates without any external long-term storage, using only node disks for storage.

3-node deployments provide HA at the Edge for local data ingestion that cannot tolerate downtime. This deployment can use node disks or PowerScale for long-term storage.

SDP Core SDP Core provides all the advantages of on-premise data collection, processing, and storage. SDP Core is intended for data center deployment with full size servers. It handles larger data ingestion needs and also accepts data collected by SDP Edge and streamed up to the Core.

12 Introduction

HA is built into all deployments. Deployments start with a minimum of 3 nodes and can expand up to 12 nodes, with built-in scaling of added resources. Recommended servers have substantially more resources than SDP Edge. Long-term storage is Dell EMC PowerScale clusters or Dell EMC ECS appliances. Multi-tenant use cases are typical. Other intended use cases build models across larger data sets and ingest large amounts of data.

Component deployment matrix This matrix shows the differences between SDP Edge and SDP Core deployments.

Table 1. Component deployment matrix

Component SDP Core 1.2 SDP Edge 1.2

Apache Flink Ships with versions 1.11.2-2.12 Ships with versions 1.11.2-2.12

Apache Spark Ships with versions 2.4.7 & 3.0.1 Ships with versions 2.4.7 & 3.0.1

Kubernetes Platform OpenShift 4.6 KubeSpray 2.14.2

Minimum number of nodes 3 1

Maximum number of nodes 12 3

Container Runtime Crio version 1.19.0 Docker version 19.03

Operating System RHEL 8.3, CORE OS 4.6 Ubuntu 18.04

Long-term storage option:

Dell EMC PowerScale

Gen5 or later hardware

OneFS 8.2.x or 9.x software with NFSv4.0 enabled

Gen5 or later hardware

OneFS 8.2.x or 9.x software with NFSv4.0 enabled

Single-node deployment supports local long-term storage

Long-term storage option:

Dell EMC ECS

ECS object storage appliance with ECS 3.5.1.4 and later, ECS 3.6.1.1 and later, and 3.7.0.9 and later

Not supported

Secure Remote Services (SRS) Gateway SRS 3.38 (minimum) Optional. If used, SRS 3.38 (minimum)

Architecture and supporting infrastructure The supporting infrastructure for SDP supplies storage and compute resources and the network infrastructure.

SDP is a software-only solution. The customer obtains the components for the supporting infrastructure independently.

For each SDP version, the reference architecture includes specific products that are tested and verified. The reference architecture is an end-to-end enterprise solution for stream processing use cases. Your Dell EMC sales representative can provide appropriate reference architecture solutions for your expected use cases.

A general description of the supporting infrastructure components follows.

Reference Hardware

SDP runs on bare metal servers using custom operating system software provided in the SDP distribution. SDP Edge runs on Ubuntu. SDP Core runs on Red Hat Enterprise Linux Core OS.

Network A network is required for communication between the nodes in the SDP cluster and for the external clients to access cluster applications.

Local storage Local storage is required for various system functions. The Dell support team helps size storage needs based on intended use cases.

Long-term storage

Long-term storage for stream data is required and is configured during installation. Long-term storage is any of the following:

Introduction 13

SDP Core production solutions require an elastic scale-out storage solution. You may use either of the following for long-term storage: A file system on a Dell EMC PowerScale cluster A bucket on the Dell EMC ECS appliance

SDP Edge production solutions uses a file system on a Dell EMC PowerScale cluster. For testing, development, or use cases where only temporary storage is needed, long-term storage

may be defined as a file system on a local mount point.

Kubernetes container environment included with SDP

SDP runs in a Kubernetes container environment. The container environment isolates projects, efficiently manages resources, and provides authentication and RBAC services. The required Kubernetes environments are provided with SDP distributions and are installed and configured as part of SDP installation. They are: SDP Edge runs in Kubespray. SDP Core runs in RedHat OpenShift.

The following figures show the supporting infrastructure in context with SDP.

Figure 2. SDP Core architecture

Figure 3. SDP Edge architecture

14 Introduction

Product highlights SDP includes the following major innovations and unique capabilities.

Enterprise-ready deployment

SDP is a cost effective, enterprise-ready product. This software platform, running on a recommended reference architecture, is a total solution for processing and storing streaming data. With SDP, an enterprise can avoid the complexities of researching, testing, and creating an appropriate infrastructure for processing and storing streaming data. The reference architecture consists of both hardware and software. The resulting infrastructure is scalable, secure, manageable, and verified. Dell EMC defines the infrastructure and provides guidance in setting it up. In this way, SDP dramatically reduces time to value for an enterprise.

SDP provides integrated support for a robust and secure total solution, including fault tolerance, easy scalability, and replication for data availability.

With SDP, Dell EMC provides the following deployment support:

Recommendations for the underlying hardware infrastructure Sizing guidance for compute and store to handle your intended use cases End-to-end guidance for setting up the reference infrastructure, including switching and network

configuration (trunks, VLANs, management and data IP routes, and load balancers) Comprehensive image distribution, consisting of customized images for the operating system,

supporting software, SDP software, and API distributions for developers Integrated installation and configuration for underlying software components (Docker, Helm,

Kubernetes) to ensure alignment with SDP requirements

The result is an ecosystem ready to ingest and store streams, and ready for your developers to code and upload applications that process those streams.

Unbounded byte stream ingestion, storage, and analytics

Pravega was designed from the outset to handle unbounded byte stream data.

In Pravega, the unbounded byte stream is a primitive structure. Pravega stores each stream (any type of incoming data) as a single persistent stream, from ingestion to long-term storage, like this:

Recent tailThe real-time tail of a stream exists on Tier 1 storage. Long-termThe entire stream is stored on long-term storage (also called Tier 2 storage in Pravega).

Applications use the same API call to access real-time data (the recent tail on Tier 1 storage) and all historical data on long-term storage.

In Apache Flink or Spark applications, the basic building blocks are streams and transformations. Conceptually, a stream is a potentially never-ending flow of data records. A transformation is an operation that takes one or more streams as input and produces one or more output streams. In both applications, non-streaming data is treated internally as a stream.

By integrating these products, SDP creates a solution that is optimized for processing unbounded streaming bytes. The solution is similarly optimized for bounded streams and more traditional static data.

High throughput stream ingestion

Pravega enables the ingestion capacity of a stream to grow and shrink according to workload. During ingestion Pravega splits a stream into partitions to handle a heavy traffic period, and then merges partitions when traffic is less. Splitting and merging occurs automatically and continuously as needed. Throughout, Pravega preserves order of data.

Stream filtering on ingestion

PSearch continuous queries process data as it is ingested, providing a way to filter out unwanted data before it is stored, or to enrich the data with tagging before it is stored.

Stream search PSearch queries can search an entire stored stream of structured or unstructured data.

Exactly-once semantics

Pravega is designed with exactly-once semantics as a goal. Exactly-once semantics means that, in a given stream processing application, no event is skipped or duplicated during the computations.

Key-based guaranteed order

Pravega guarantees key-based ordering. Information in a stream is keyed in a general way (for example, by sensor or other application-provided key). SDP guarantees that values for the same key are stored and processed in order. The platform, however, is free to scale the storage and processing across keys without concern for ordering.

The ordering guarantee supports use cases that require order for accurate results, such as in financial transactions.

Introduction 15

Massive data volume

Pravega accommodates massive data ingestion. In the reference architecture, Dell EMC hardware solutions support the data processing and data storage components of the platform. All the processing and storage reference hardware are easily scaled out by adding additional nodes.

Batch and publish/ subscribe models supported

Pravega, Apache Spark, and Apache Flink support the more traditional batch and publish/subscribe pipeline models. Processing for these models includes all the advantages and guarantees that are described for the continuous stream models.

Pravega ingests and stores any type of stream, including:

Unbounded byte streams, such as data streamed from IoT devices Bounded streams, such as movies and videos Unbounded append-type log files Event-based input, streaming or batched

In Apache Flink and Apache Spark, all input is a stream. Both process table-based input and batch input as a type of stream.

ACID-compliant transaction support

The Pravega Writer API supports Pravega transactions. The Writer can collect events, persist them, and decide later whether to commit them as a unit to a stream. When the transaction is committed, all data that was written to the transaction is atomically appended to the stream.

The Writer might be an Apache Flink or other application. As an example, an application might continuously process data and produce results, using a Pravega transaction to durably accumulate the results. At the end of a time window, the application might commit the transaction into the stream, making the results of the processing available for downstream processing. If an error occurs, the application cancels the transaction and the accumulated processing results disappear.

Developers can combine transactions and other features of Pravega to create a chain of Flink jobs. The Pravega-based sink for one job is the source for a downstream Flink job. In this way, an entire pipeline of Flink jobs can have end-to-end exactly once, guaranteed ordering of data processing.

In addition, applications can coordinate transactions across multiple streams. A Flink job can use two or more sinks to provide source input to downstream Flink jobs.

Pravega achieves ACID compliance as follows:

Atomicity and Consistency are achieved in the basic implementation. A transaction is a set of events that is collectively either added into a stream (committed) or discarded (aborted) as a batch.

Isolation is achieved because the transactional events are never visible to any readers until the transaction is committed into a stream.

Durability is achieved when an event is written into the transaction and acknowledged back to the writer. Transactions are implemented in the same way as stream segments. Data that is written to a transaction is as durable as data written directly to a stream.

Security Access to SDP and the data it processes is strictly controlled and integrated throughout all components.

Authentication is provided through both Keycloak and LDAP. Kubernetes and Keycloak role-based access control (RBAC) protect resources throughout the

platform. TLS controls external access. Within the platform, the concept of a project defines and isolates resources for a specific analytic

purpose. Project membership controls access to those resources.

For information about these and other security features, see the Dell EMC Streaming Data Platform Security Configuration Guide at https://dl.dell.com/content/docu103273.

More features Here are additional important capabilities in SDP.

Fault tolerance The platform is fault tolerant in the following ways:

All components use persistent volumes to store data. Kubernetes abstractions organize containers in a fault-tolerant way. Failed pods restart automatically,

and deleted pods are created automatically.

16 Introduction

Certain key components, such as Keycloak, are deployed in "HA" mode by default. In the Keycloak case, three Keycloak pods are deployed, clustered together, to provide near-uninterrupted access even if a pod goes down.

Data retention and data purge

Pravega includes the following ways to purge data, per stream:

A manual trigger in an API call specifies a point in a stream beyond which data is purged. An automatic purge may be based on size of stream. An automatic purge may be based on time.

Historical data processing

Historical stream processing supports:

Stream cuts Set a reading start point.

Apache Flink job management

Authorized users can monitor, start, stop, and restart Apache Flink jobs from the SDP UI. The Apache Flink savepoint feature permits a restarted job to continue processing a stream from where it left off, guaranteeing exactly-once semantics.

Apache Spark job management

Authorized users can monitor, start, stop, and restart Apache Spark jobs from the SDP UI.

Monitoring and reporting

From the SDP UI, administrators can monitor the state of all projects and streams. Other users (project members) can monitor their specific projects.

Dashboard views on SDP UI show recent Pravega ingestion metrics, read and write metrics on streams, and long-term storage metrics.

Heat maps of Pravega streams show segments as they are split and merged, to help with resource allocation decisions.

Stream metrics show throughput, reads and writes per stream, and transactional metrics such as commits and aborts.

Latencies at the segment store host level are available, aggregated over all segment stores.

The following additional UIs are linked from the SDP UI.

Project members can jump directly to the Flink Web UI that shows information about their jobs. The Apache Flink Web UI monitors Flink jobs as they are running.

Project members can jump directly to the Spark Web UI that shows information about their jobs. The Apache Spark Web UI monitors Spark jobs as they are running.

Administrators can jump directly to the Grafana UI with a predefined plug-in for Pravega metrics. Administrators can view Pravega JVM statistics, and examine stream throughputs and latency metrics.

Project members can jump to the project-specific Grafana UI (if the project was deployed with metrics) to see Flink, Spark, and Pravega Search operational metrics.

Project members can jump directly to the Kibana UI from a Pravega Search cluster page.

Logging Kubernetes logging is implemented in all SDP components.

Remote support Secure Remote Services and call home features are supported for SDP. These features require an SRS Gateway server that is configured to monitor the platform. Detected problems are forwarded to Dell Technologies as actionable alerts, and support teams can remotely connect to the platform to help with troubleshooting.

Event reporting Services in SDP collect events and display them in the SDP UI. The UI offers search and filtering on the events, including a way to mark them as acknowledged. In addition, some critical events are forwarded to the SRS Gateway.

SDP Code Hub The SDP Code Hub is a centralized portal to help application developers getting started with SDP applications. Developers can browse and download example applications and code templates, get Pravega connectors, and view demos. Applications and templates from Dell EMC teams include Pravega samples, Flink samples, Spark samples, and API templates. See the Code Hub here.

Schema Registry Schema Registry provides a serving and management layer for storing and retrieving schemas for application metadata. A shared repository of schemas allows applications to flexibly interact with each other and store schemas for Pravega streams.

Introduction 17

Basic terminology The following terms are basic to understanding the workflows supported by SDP.

Pravega scope The Pravega concept for a collection of stream names. RBAC for Pravega operates at the scope level.

Pravega stream A durable, elastic, append-only, unbounded sequence of bytes that has good performance and strong consistency. A stream is uniquely identified by the combination of its name and scope. Stream names are unique within their scope.

Pravega event A collection of bytes within a stream. An event has identifying properties, including a routing key, so it can be referenced in applications.

Pravega writer A software application that writes data to a Pravega stream.

Pravega reader A software application that reads data from a Pravega stream. Reader groups support distributed processing.

Flink application An analytic application that uses the Apache Flink API to process one or more streams. Flink applications may also be Pravega Readers and Writers, using the Pravega APIs for reading from and writing to streams.

Flink job Represents an executing Flink application. A job consists of many executing tasks.

Flink task A Flink task is the basic unit of execution. Each task is executed by one thread.

Spark application An analytic application that uses the Apache Spark API to process one or more streams.

Spark job Represents an executing Spark application. A job consists of many executing tasks.

Spark task A Spark task is the basic unit of execution. Each task is executed by one thread.

RDD Resilient Distributed Dataset. The basic abstraction in Spark that represents an immutable, partitioned collection of elements that can be operated on in parallel.

Project An SDP concept. A project defines and isolates resources for a specific analytic purpose, enabling multiple teams of people to work within SDP in separate project environments.

Project member An SDP user with permission to access the resources in a specific project.

Kubernetes environment

The underlying container environment in which all SDP services run. The Kubernetes environment is abstracted from end-user view. Administrators can access the Kubernetes layer for authentication and authorization settings, to research performance, and to troubleshoot application execution.

Schema registry A registry service that manages schemas & codecs. It also stores schema evolution history. Each stream is mapped to a schema group. A schema group consists of schemas & codecs that are associated with applications.

Pravega Search cluster

Resources that process Pravega Search indexing, searches, and continuous queries.

Interfaces SDP includes the following interfaces for developers, administrators, and data analysts.

Table 2. Interfaces in SDP

Interface Purpose

SDP User Interface Configure and manage streams and analytic jobs. Upload analytic applications.

Pravega Grafana custom dashboards Drill into metrics for Pravega.

Apache Flink Web User Interface Drill into Flink job status.

Apache Spark Web User Interface Drill into Spark job status.

Keycloak User Interface Configure security features.

Pravega and Apache Flink APIs Application development.

18 Introduction

Table 2. Interfaces in SDP (continued)

Interface Purpose

Project-specific Grafana custom dashboards

Drill into metrics for Flink, Spark, and Pravega Search clusters.

Project-specific Kibana Web User Interface Submit Pravega Search queries.

In addition, users may download the Kubernetes CLI (kubectl) for research and troubleshooting for the SDP cluster and its resources. This includes support for the SDP custom resources, such as projects.

User Interface (UI)

The Dell EMC Streaming Data Platform provides the same user Interface for all personas interacting with the platform.

The views and actions available to a user depend on that user's RBAC role. For example: Logins with admin role see data for all existing streams and projects. In addition, the UI contains buttons that let them

create projects, add users to projects, and other management tasks. Those options are not visible to other users. Logins with specific project roles can see their projects and the streams, applications, and other resources that are

associated with their projects.

Here is a view of the initial UI window that administrators see when they first log in. Admin see all metrics for all the streams in the platform.

Figure 4. Initial administrator UI after login

Project members (non-admin users) do not see the dashboard. They only see the Analytics and the Pravega tabs for the streams in their projects.

Grafana dashboards

SDP includes the collection, storage, and visualization of detailed metrics.

SDP deploys one or more instances of metrics stacks. One instance is for gathering and visualizing Pravega metrics. Additional project-specific metrics stacks are optionally deployed.

A metrics stack consists of an InfluxDB database and Grafana.

InfluxDB is an open-source database for storing time series data. Grafana is an open-source metrics visualization tool. Grafana deployments in SDP include predefined dashboards that

visualize the collected metrics in InfluxDB.

Developers can create their own custom Grafana dashboards as well, accessing any of the data stored in InfluxDB.

Introduction 19

Pravega metrics

In SDP, InfluxDB stores metrics that are reported by Pravega. The Dashboards page on the SDP UI shows some of these metrics. More detail is available on the predefined Grafana dashboards. Administrators can use these dashboards to drill into problems or identify developing memory problems, stream-related inefficiencies, or problems with storage interactions.

The SDP UI Dashboards page contains a link to the Pravega Grafana instance. The Dashboards page and the Pravega Grafana instance are available only to administrators.

Project metrics

An optional Metrics choice is available when a project is created. For a project that has Metrics enabled, SDP deploys a project-specific metrics stack. The InfluxDB collects metrics for Spark applications, Flink clusters, and Pravega Search clusters in the project. Predefined Grafana dashboards exist for visualizing the collected metrics.

The SDP UI project page contains a link to that project's Grafana instance. These instances are available to project members and administrators.

Application specific analytics

For projects that have metrics enabled, developers can add new metrics collections into their applications, and push the metrics to the project-specific InfluxDB instance. Any metric in InfluxDB is available for use on customized Grafana dashboards.

Apache Flink Web UI

The Apache Flink Web UI shows details about the status of Flink jobs and tasks. This UI helps developers and administrators to verify Flink application health and troubleshoot running applications.

The SDP UI contains direct links to the Apache Flink Web UI in two locations:

From the Analytics Project page, go to a project and then click a Flink Cluster name. The name is a link to the Flink Web UI which opens in a new browser tab. It displays the Overview screen for the Flink cluster you clicked. From here, you can drill into status for all jobs and tasks.

Figure 5. Apache Flink Web UI From the Analytics Project page, go to a project and then click a Flink Cluster name. Continue browsing the applications

running in the cluster. On an application page, a Flink sub-tab opens the Apache Flink UI. That UI shows the running Flink Jobs in the application.

20 Introduction

Apache Spark Web UI

The Apache Spark Web UI shows details about the status of Spark jobs and tasks. This UI helps developers and administrators to verify Spark application health and troubleshoot running applications.

The SDP UI contains direct links to the Apache Spark Web UI. From the Analytics Project page, go to a project, click Spark, and click a Spark application name. The name is a link to the Spark Web UI which opens in a new browser tab. It displays the Overview screen for the Spark application you selected. From here, you can drill into status for all jobs and tasks.

Figure 6. Apache Spark Web UI

APIs

The following developer resources are included in an SDP distribution.

SDP includes these application programming interfaces (APIs):

Pravega APIs, required to create the following Pravega applications: Writer applications, which write stream data into the Pravega store. Reader applications, which read stream data from the Pravega store.

Apache Flink APIs, used to create applications that process stream data. Apache Spark APIs, used to create applications that process stream data. PSearch APIs, used to register continuous queries or process searches against the stream. Schema Registry APIs, used to retrieve and perform schema registry operations.

Stream processing applications typically use these APIs to read data from Pravega, process or analyze the data, and perhaps even create new streams that require writing into Pravega.

What you get with SDP The SDP distribution includes the following software, integrated into a single platform.

Kubernetes environments Dell EMC Streaming Data Platform management plane software Keycloak software and an integrated security model Pravega data store and API Schema registry for managing schemas and codecs Pravega Search (PSearch) framework, query processors, and APIs Apache Flink framework, processing engine, and APIs

Introduction 21

Apache Spark framework, processing engine, and APIs InfluxDB for storing metrics Grafana UI for presenting metrics Kibana UI for presenting metrics SDP installer, scripts and other tools

Use case examples Following are some examples of streaming data use cases that Dell EMC Streaming Data Platform is especially designed to process.

Industrial IoT Detect anomalies and generate alerts. Collect operational data, analyze the data, and present results to real-time dashboards and trend

analysis reporting. Monitor infrastructure sensors for abnormal readings that can indicate faults, such as vibrations or

high temperatures, and recommend proactive maintenance. Collect real-time conditions for later analysis. For example, determine optimal wind turbine placement

by collecting weather data from multiple test sites and analyzing comparisons.

Streaming Video Store and analyze streaming video from drones in real time. Conduct security surveillance. Serve on-demand video.

Automotive Process data from automotive sensors to support predictive maintenance. Detect and report on hazardous driving conditions that are based on location and weather. Provide logistics and routing services.

Financial Monitor for suspicious sequences of transactions and issue alerts. Monitor transactions for legal compliance in real-time data pipelines. Ingest transaction logs from market exchanges and analyze for real-time market trends.

Healthcare Ingest and save data from health monitors and sensors. Feed dashboards and trigger alerts for patient anomalies.

High-speed events

Collect and analyze IoT sensor messages. Collect and analyze Web events. Collect and analyze logfile event messages.

Batch applications

Batch applications that collect and analyze data are supported.

Documentation resources Use these resources for additional information.

Table 3. SDP documentation set

Subject Reference

Dell EMC Streaming Data Platform documentation

Dell EMC Streaming Data Platform Documentation InfoHub: https://www.dell.com/support/article/us/en/19/sln319974/ dell-emc-streaming-data-platform-infohub

Dell EMC Streaming Data Platform support site: https://www.dell.com/support/home/us/en/04/ product-support/product/streaming-data-platform/overview

(This guide) Dell EMC Streaming Data Platform Developer's Guide at https://dl.dell.com/content/docu103272

Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271

22 Introduction

Table 3. SDP documentation set (continued)

Subject Reference

Dell EMC Streaming Data Platform Security Configuration Guide at https://dl.dell.com/content/docu103273

Dell EMC Streaming Data Platform Release Notes 1.2 at https:// dl.dell.com/content/docu103274

NOTE: You must log onto a Dell support account to access release notes.

SDP Code Hub Community-supported public Github portal for developers and integrators. The SDP Code Hub includes Pravega connectors, demos, sample applications, API templates, and Pravega and Flink examples from the open-source SDP developer community: https:// streamingdataplatform.github.io/code-hub/

Pravega concepts, architecture, use cases, and Pravega API documentation

Pravega open-source project documentation:

http://www.pravega.io

Apache Flink concepts, tutorials, guidelines, and Apache Flink API documentation

Apache Flink open-source project documentation:

https://flink.apache.org/

Apache Spark concepts, tutorials, guidelines, and Apache Spark API documentation

Apache Spark open-source project documentation:

https://spark.apache.org

https://github.com/StreamingDataPlatform/workshop-samples/tree/ master/spark-examples

Introduction 23

Using the Streaming Data Platform (SDP)

Topics:

About installation and administration User access roles About project isolation Log on as a new user Change your password or other profile attributes Overview of the project-member user interface Analytics projects and Pravega scopes views Create analytics projects and clusters About stream processing analytics engines and Pravega connectors

About installation and administration Your platform administrator installs and administers all components of the Streaming Data Platform.

Distributed data processing frameworks like Apache Flink or Apache Spark need to be set up to interact with several components, such as resource managers, file systems, and services for distributed coordination. Your platform administrator might install these processing engine on more than one node for additional processing power. Each installation of Flink or Spark is licensed and tracked, and Dell EMC can help you determine how many licenses to include in your configuration.

For information about installation and administering Flink and Spark clusters and Pravega streams, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271

User access roles The SDP provides one user interface (UI) for all users interacting with stream analytics. Authorization roles control the views and actions available to each user login.

For example:

Platform administrator roles have authorization to create analytics projects, add users to projects, and manage storage resources. Those options are not visible to logins with a user role.

Project member roles are assigned to projects by platform administrators as follows: Project members see only the streams and applications that are associated with their projects. Users assigned to a project member role are typically developers or data analysts. Project members can view project-related streams, create and update Flink or Spark clusters, upload applications and

artifacts. Project members must be added to the platform and to analytics projects by administrators. For more information, see the

section on adding new members in the Dell EMC Streaming Data Platform Installation and Administration Guide.

The SDP supports the following types of activities for administrators, developers, and data analysts.

Role Persona Activities

Administrator Security Administrator Configure logins View security-related logs

Platform Administrator Create projects and add users to projects

Monitor system throughput, storage, and resource allocation

2

24 Using the Streaming Data Platform (SDP)

Role Persona Activities

Project-member Developer Add streams and applications to projects

Define streams and configure stream properties

Create and update Flink clusters Upload production-ready application

images, applications, and associated artifacts

Start, stop, and troubleshoot application jobs

Data Analyst Work within a project Start and stop applications Adding members to existing scope

and streams Monitor application job status Research stream metrics

Platform administrators have access to all projects by default, and projects are created and managed by administrators. Administrators ensure that a projects members include the set of users who need access to the project. Administrators monitor resources associated with stream storage and application processing. They may also monitor stream ingestion and scaling.

A project-member is usually a developer or data analyst.

Developers typically upload their application artifacts, choose or create the required streams associated with the project, and run and monitor applications. They may monitor streams as well.

Data analysts may run and monitor applications. They may also monitor or analyze metrics for the streams in the project.

About project isolation Groups of users with differing functions work within a Streaming Data Platform project. A project defines and isolates resources for a specific analytics purpose. Each project has a dedicated Kubernetes namespace, which enhances project isolation.

A projects resources are protected using RBAC by the following:

Project resources are protected at the Keycloak level. For example, access to the Pravega scopes and its streams and Maven artifacts.

Project resources are protected at the Kubernetes level. For example, any namespace resources in the Kubernetes cluster for this project, such as FlinkCluster, FlinkApplication, and Pods.

For additional information about RBAC and security, see the Dell EMC Streaming Data Platform Security Configuration Guide at https://dl.dell.com/content/docu103273 and Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271.

Log on as a new user Youll need a user account configured by your platform administrator to log on to the SDP. Ask your administrator for the URL. A user can change the password or other profile attributes for the username that they are currently logged in under.

Steps

1. Type the URL of the SDP UI in a web browser.

NOTE: Obtain the URL from your Platform Administrator.

Using the Streaming Data Platform (SDP) 25

The Streaming Data Platform login window displays.

2. Using one of the following credential sets, provide a Username and Password.

If your platform administrator provided credentials, use those. For environments where an existing Identity Provider is integrated, use your enterprise credentials.

3. Click Log In.

If your user name and password are valid, you are authenticated to SDP. Depending on your authorization, you see one of the following windows:

If you see this window Explanation

The Dashboard section of the SDP UI. The dashboard appears if the username has authorizations associated with it. It shows streams and metrics that you are authorized to view.

A welcome message asking you to see your Administrator.

The welcome message appears if there are no authorizations associated with the username. You are authenticated but not authorized to see any data. Ask an Administrator to provide authorizations to your username.

4. If you see the welcome message directing you to an Administrator, ask for one of the following authorizations:

Purpose Authorization Request

View and administer all projects and streams. Platform Administrator

View or develop applications or streams within a project. Project member

View or develop applications or streams within a project that does not yet exist.

A Platform Administrator must create the new project and add you as a member.

26 Using the Streaming Data Platform (SDP)

Purpose Authorization Request

View or develop streams in a scope that is independent of any project and does not yet exist.

A Platform Administrator must create the new scope and add you as a member.

Change your password or other profile attributes A user can change the password or other profile attributes for the username that they are currently logged in under.

Steps

1. NOTE: When federation is enabled, password changes must be made in the federated identity provider. The federated

credentials are used for authentication for all operations in the SDP and kubectl. In order to use basic authentication

with Gradle to publish artifacts to a project's Maven repository, you must set a local password in Keycloak for the

local/shadow user that was created when the federated user logged into the SDP.

Log in to the SDP using the username whose password or other profile attributes you want to change.

2. In the top banner, click the User icon.

3. Verify that the username at the top of the menu is the username whose profile you want to change.

4. Choose Edit Account.

5. To change the password, complete the password-related fields.

6. Edit other fields in the profile if needed.

7. Click Save.

8. To verify a password change, log out and log back in using the new password. To logout, click the User icon and choose Logout.

Overview of the project-member user interface This section summarizes the top-level screens and menus in the SDP UI available to users in the project-member role.

The top banner contains icons for navigating to the portions of the UI that a project-member has access to.

Icon Description

Analytics Displays all the projects that include the login account as a project-member. You can: View summary information about all the analytics projects assigned to you. Drill into a project page. NOTE: Only platform administrators can create new projects.

For platform administrator information, see Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271

Pravega Displays all the scopes that your login account has permission to access. You can: View summary information about the scopes assigned to you. Drill into a scope page, and from there, into streams. Create new streams. Create new scopes, if you are a platform administrator.

User The User icon displays a drop-down menu with these options: The user name that you used to log into the current session. (This item is not clickable.) Product SupportOpens a SDP documentation InfoHub. From there, you can access Dell

EMC product documentation and support. Logout.

Using the Streaming Data Platform (SDP) 27

Analytics projects and Pravega scopes views The Analytics Projects and Pravega Scopes views are available to users in the project-member role.

View projects and the project dashboard

Log into the SDP and click the Analytics icon to navigate to the Analytics Projects view. This lists the projects you have access to.

Navigate using the links at the top of the project dashboard to view Apache Flink clusters, applications, and artifacts associated with the project. Later sections of this guide provide information about how to create Flink clusters, create applications, and upload artifacts.

Users with the project-member credentials can view the following:

Project-members (non-administrative users) can only see the projects assigned to them. Project-members cannot create analytics projects. Only users with platform administrator credentials have the rights to

create analytics projects. Project-members have permissions to create and update Flink clusters, Flink and Spark applications, artifacts, Pravega

Search (PSearch) clusters, and Schema Registry components, such as schema groups, schemas, cqs, and codecs assigned to their projects.

Click a project name to view the Dashboard for that project.

Figure 7. Project Dashboard

The page header includes information about the project:

Created Date State Description If your system is configured for Dell EMC ECS storage, the project's system-generated bucket name and access credential

displays

Click the tabs along the top of the dashboard to drill down into the project and view the following:

Link name Description

Flink Clusters View the configured Flink clusters.

Flink Apps View the applications defined in the project. Project-members can use this tab to create new

applications, remove applications, edit application configurations. and update artifacts.

Spark Apps View the applications defined in the project.

28 Using the Streaming Data Platform (SDP)

Link name Description

Project-members can use this tab to create new applications, remove applications, edit application configurations. and update artifacts.

Artifacts View the application artifacts uploaded to the project's Maven repository.

Project-members can upload, update, and remove artifacts.

Members Add members.

Pravega Search Search the configured clusters, searchable streams, and continuous queries.

Additionally, the project dashboard displays the following:

Number of task slots used/number of task slots available Number of streams associated with the project Number of jobs running, finished, cancelled, and failed Kubernetes event messages related to the project in the message section

Pravega scopes view

A Pravega scope is a collection of projects and associated streams. RBAC authentication for Pravega operates at the scope level.

To navigate to the Pravega Scopes view, log on to the SDP and click the Pravega icon in the top banner.

Project-members (non-administrative users) have access to the following:

Project-members can view project-related scopes. Project-members cannot create new Pravega scopes or view any scopes that are not assigned to their projects. Project-members cannot create scopes outside of a project or manage scope membership in the SDP UI.

For more information about managing Pravega scopes, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271.

Pravega streams view

A stream is uniquely identified by the combination of its name and scope. The Pravega streams view lists the streams available within a specific project scope. Project-members typically create the data streams associated with a project. They may also monitor those streams. To view a project's Pravega streams, log on to the SDP and click the Pravega icon in the top banner. Click the name of a Project Scope. From here, you can view streams and add/view schema registry groups and codecs. For

more information, see the "About Pravega schema registry" section in this guide. NOTE: You can also access the details for a single stream from the Pravega > Pravega Streams view.

View the following details about the stream: Name of the stream Scaling policy Retention policy Bytes written Bytes read Transactions open, created, or aborted Stream segment count Segment heat chart

For more information about streams, see the Working with Pravega streams section.

The views and actions available to a user in the Dell EMC Streaming Data Platform depend on a user's RBAC role:

Using the Streaming Data Platform (SDP) 29

Users in the platform administrator role can see data for all existing streams and projects. In addition, the UI displays icons that allow admins to create projects, add users to projects, and create Pravega scopes. Those options are not visible to project-members.

Users in the project-member role, for example developers and data analysts, can see the projects and streams, applications, and other resources that are associated with their projects.

Create analytics projects and clusters Only platform administrators can create analytics projects in the Streaming Data Platform. Both platform administrators and project-members can create Flink clusters.

Unlike Flink, because Spark defines its cluster in the application resource, there is no need to create a Spark cluster before deploying the application. A Spark cluster consists of a Spark operator, driver, and executor pods.

About analytics projects

A SDP analytics project is a Kubernetes custom resource of kind project. A project resource is a Kubernetes namespace enhanced with resources and services. An analytics project consists of Flink clusters, applications, artifacts, Pravega Search (PSearch) clusters, and project members.

Only platform administrators can create projects and add members to projects. Project members can see information about the projects that they belong to.

Platform administrators can view all projects and create new projects in the UI. If more control of project definition is required, administrators can create projects manually using kubectl commands. For more information, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271.

All analytic capabilities are contained within projects. If enabled upon project creation by platform administrators, analytic metrics are enabled for a project.

Projects provide support for multiple teams working on the same platform, while isolating each team's resources from the others. Project members can collaborate in a secure way, as resources for each project are managed separately.

When an analytics project is created through the Dell EMC Streaming Data Platform UI, the platform automatically creates the following:

Resources and services Function

Maven repository Hosts job artifacts

Zookeeper cluster Allows fault tolerant Flink clusters

Project storage A persistent volume claim (PVC) to provide shared clusters between Flink clusters

Kubernetes namespace Houses the project resources and services

Keycloak credentials Configured to allow authorized access to the project Pravega scope and analytics jobs to communicate with Pravega

Pravega scope A Pravega scope is created with the same name as the project to contain all the project streams. Scopes created this way automatically appear in the list of scopes. Keycloak credentials allow access to the Pravega scope

Security Security features allow applications belonging to the project to access the Pravega streams within the project scope. For more information about security features, see the Dell EMC Streaming Data Platform Security Configuration Guide at https://dl.dell.com/content/docu103273 and the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271.

30 Using the Streaming Data Platform (SDP)

About Flink clusters

Flink clusters provide the compute capabilities for analytics applications. Using the SDP UI, one or more fault-tolerant Flink clusters can be deployed to a project.

Both platform administrators and project-members can deploy Flink clusters.

Flink clusters can be deployed into a project as follows:

Administrators and project members can deploy Flink clusters using the SDP UI. If customized configurations of compute resources are required, you can deploy clusters manually using Kubernetes

kubectl commands.

To use kubectl commands, project-members must be federated users. For more information, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271.

Components of a Flink setup

Distributed data processing frameworks like Apache Flink need to be set up to interact with several components, such as resource managers, file systems, and services for distributed coordination. Platform administrators manage these components using the SDP UI.

Apache Flink consists of components that work together to execute streaming applications. For example, the job manager and task managers have the following functions:

Job Manager

The job manager is the master process that controls the execution of a single application. An application consists of a logical data flow graph and a JAR file that bundles all the required classes, libraries, and other resources.

The job manager requests the necessary resources (Task Manager slots) to execute the tasks from the Resource Manager. Once it receives enough task manager slots, it distributes the tasks to the task managers that execute them. During execution, the job manager is responsible for all actions that require central coordination, such as the coordination of checkpoints.

Task Managers

Task managers are the worker processes of Flink. Typically, there are multiple Task Managers running in a Flink setup. Each task manager provides a certain number of slots. The number of slots limits the number of tasks a task manager can execute. When instructed by the Resource Manager, a task manager offers one or more of its slots to the job manager. The job manager can then assign tasks to the slots to execute them.

About Flink jobs

Analytics applications define the parameters and required environment for a deployment.

For Apache Flink jobs to be deployed, they must be scheduled on Flink clusters as standard Flink jobs. Each start and stop of an analytics application results in a new Flink job being deployed to a cluster.

Create Flink clusters

After a platform administrator creates an analytics project, project-members can create Flink clusters using the SDP UI.

Steps

1. Determine if the cluster requires a custom configuration. The SDP uses default values for most configuration attributes when creating Flink clusters. If a custom configuration is required, a project-member can create a Flink cluster using the Kubernetes command line.

Custom configurations are required in the following situations:

Using the Streaming Data Platform (SDP) 31

Custom Flink images for processing

Developers may want to process their applications using a custom Flink image. SDP ships with images for 1.8.2, 1.9.1, 1.9.2, and 1.10.0. Any other image would be considered custom. Dell EMC recommends that you use the latest Apache Flink version available (1.10.0 at the time of this publication) when creating new SDP Flink applications.

NOTE: SDP v1.1 ships with a Flink 1.9.0 image, but this should be considered deprecated and is

distributed for older, already-deployed applications.

The supported versions are prebuilt Apache Flink images that are included with the SDP. The Flink binaries in the images are enterprise versions of Flink provided by our enterprise Flink business partner.

Custom labels for scheduling

Developers may want to use Flink cluster labels to assign jobs to specific Flink clusters for processing. The SDP uses the Flink image version number to assign the cluster to a job. For more flexibility, developers can incorporate labels.

NOTE: Registering and using custom Flink images is supported by creating a ClusterFlinkImage resource. To view the

default cluster Flink images installed, run the following Kubernetes command: kubectl get clusterflinkimages

2. To create a Flink cluster using the Dell EMC Streaming Data Platform UI, log on and navigate to Analytics -> project-name > Flink Clusters.

3. Click Create Flink Cluster.

4. Complete the cluster configuration screen. The following attributes are configurable:

Section Attribute Description

General Name Labels are optional. The label name must conform to Kubernetes naming conventions.

32 Using the Streaming Data Platform (SDP)

Section Attribute Description

Label Enter a custom label key name and value.

Flink Image Choose the Flink image for this cluster. The SDP presents options taken from the registered ClusterFlinkImage resources.

Task Manager Number of Replicas You can configure multiple task managers. Enter the number of Apache Flink task managers in the cluster in the Replicas field.

NOTE: A standalone Apache Flink cluster consists of at least one Job Manager (the master process) and one or more task managers (worker processes) that run on one or more machines.

Number of Task Slots Enter the number of task slots per task manager (at least one). Each task manager is a JVM process that can execute one or more subtasks in separate threads. The number of task slots controls how many tasks a task manager accepts.

NOTE: The task slots represent the equal dividing-up of the resources a task manager provides. For example, if a task manager was configured with 2 CPU cores, and 2048 MB of memory, then each task slot would get roughly 1 CPU and 1024 MB of heap. For more information, see the Apache Flink documentation: https://ci.apache.org/projects/ flink/flink-docs-stable/concepts/ runtime.html#task-slots-and- resources.

Memory Specify the memory allocation. The default value is 1024 MB.

CPU Enter the number of cores. The default value is 0.5 cores. Dell EMC recommends 1 CPU per task slot.

Local Storage Volume Size Specify the Storage Type as either Ephemeral or Volume Based. Ephemeral temporary storage is

provided automatically to your instance by the host node. This is the preferred storage option for task managers. Every time the task manager starts, it starts with clean temporary storage.

Volume based storage is provided by persistent storage PVCs. This storage is not overwritten/wiped when the task manager restarts,

Using the Streaming Data Platform (SDP) 33

Section Attribute Description

and is always available, regardless of the state of the running instance.

For more information about Apache Flink task managers and task slots, see the Apache Flink documentation: https:// ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html

5. Click Save.

In the Flink Cluster view that appears, the cluster State is initially Deploying. After a few seconds, the State changes to Ready. The cluster is now ready for project-members to create applications and upload artifacts.

Change Flink cluster attributes

Project-members can edit Flink clusters to change the number of Apache Flink task manager replicas.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > project-name > Flink Clusters.

3. The Edit Flink Cluster page appears. Change the number of task manager replicas and then click Save.

34 Using the Streaming Data Platform (SDP)

Delete a Flink cluster

You can delete a Flink cluster and define a new cluster within a project without having to remove any applications.

Prerequisites

Before you delete a Flink cluster, check for the impact on applications that are currently running on the cluster. If you delete a Flink cluster that has applications deployed and running on it, then the SDP first safely stops all Flink applications (the Flink cluster goes into a "draining" state while this is happening). Once all Flink applications have been stopped, the cluster is deleted. In the meantime, any Flink application that was ejected from the Flink cluster immediately goes into a Scheduling state, while searching for a new cluster to deploy onto. Once one is found, it deploys and continues executing from where it left off.

About this task

Deleting a cluster does not delete any underlying Flink applications associated with that cluster.

Steps

1. Log on to the SDP.

2. Navigate to Analytics -> project-name.

3. Click the Flink Clusters tab.

4. Identify the cluster to delete and then click Delete under the Action column. A popup menu displays two buttons: Delete and Cancel.

5. Confirm that you want to delete the cluster by clicking Delete. Click Cancel to keep the cluster.

About stream processing analytics engines and Pravega connectors Specialized connectors for Pravega enable applications to perform read and write operations over Pravega streams. Streaming Data Platform includes two built-in analytics engines for Pravegathe Apache Flink and Apache SparkTM connectorswhich are integrated into the SDP UI.

Because Pravega is open-source software, you may develop your own connectors and contribute them to the Pravega community on the SDP Code GitHub site. All connectors are reviewed and approved by Dell EMC.

For more information about developing Pravega connectors, see http://www.pravega.io and the SDP code hub: https:// streamingdataplatform.github.io/code-hub/.

About the Apache Flink integration

Apache Flink job management for Pravega is an integrated component in the SDP UI. Flink allows users to build an end-to-end stream processing pipeline that can read and write Pravega streams using the Flink stream processing framework. By combining the features of Apache Flink and Pravega, it is possible to build a pipeline of multiple Flink applications that can be chained together to give end-to-end exactly-once guarantees across the chain of applications.

Apache Flink is a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. The Flink integration into SDP:

Ensures parallel reads and writes Exposes Pravega streams as Table API records Allows checkpointing Guarantees exactly-once processing with Pravega

About the Apache SparkTM integration

Apache SparkTM job management for Pravega is an integrated component in the Streaming Data Platform.

The Apache Spark integration allows users to manage applications and view Spark analytic metrics from within the SDP UI.

Using the Streaming Data Platform (SDP) 35

The integration includes support for Java, Scala, and Python (pySpark) and Spark 2.4 and 3.0.1.

To learn about developing stream processing applications using Apache Spark, see the Apache Spark open-source documentation at https://spark.apache.org.

36 Using the Streaming Data Platform (SDP)

Working with Apache Flink Applications This section describes using Apache Flink applications with the SDP.

Topics:

About hosting artifacts in Maven repositories About Apache Flink applications Application lifecycle and scheduling View status and delete applications

About hosting artifacts in Maven repositories Each analytics project has a local Maven repository that is deployed within its namespace for hosting application artifacts that are defined in the project.

Apache Flink and Spark look at the Maven repository to resolve artifacts when launching an application. The SDP also supports external Enterprise Repositories, allowing the SDP to pull artifacts from Maven repositories external to SDP.

Project-members can upload/download applications and artifacts to the Maven repository for the project(s) they have been provided access to using the SDP UI or through Maven/Gradle using HTTP basic authentication.

Platform administrators can upload/download applications and artifacts across all projects and Maven repositories.

The Maven repository API is protected using HTTP basic authentication. Both Maven and Gradle can be configured to supply basic authentication credentials when accessing the API. For more information, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271

Applications must be built with the following runtime dependency, which is the library that allows authentication to Keycloak.

... https://oss.jfrog.org/artifactory/jfrog-dependencies ... ... 0.6.0-6.1a4bae7-SNAPSHOT ...

Coordinates: io.pravega pravega-keycloak-client ${pravega.keycloak.version}

About artifacts for Flink projects

Artifacts can be of two types: Maven or File.

Before applications can be deployed to a cluster, the JAR files must be uploaded. Upload the files by directly publishing using the Maven CLI mvn tool or by uploading the files using the SDP UI. There are two types of repositories:

Local repositories can be internal repositories set up on a file or HTTP server within your company. These repositories are used to share private artifacts between development teams and for releases.

Remote repositories refer to any other type of repository accessed by various protocols, such as file:// and http://. These repositories might be a remote repository set up by a third party to provide their artifacts for downloading. For example, repo.maven.apache.org is the location of the Maven central repository.

Local and remote repositories should be structured similarly so that scripts can run on either side and can be synchronized for offline use. The layout of the repositories should be transparent to the user.

3

Working with Apache Flink Applications 37

NOTE: A user in the platform administrator role configures the volume size during creation of an analytics project. The

Maven volume size is the anticipated space requirement for storing application artifacts that are associated with the

project. The SDP provisions this space in the configured vSAN.

For more information, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/ content/docu103271

Upload Flink application artifacts

Perform the following steps to upload Flink application artifacts using the SDP UI.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > project-name and click the Artifacts tab.

3. Select either Maven or File to view those artifacts.

4. Click Upload Artifact to upload a Maven artifact or Upload File to upload a file.

The upload button displayed depends on the choice of Maven or Files button.

Figure 8. Upload artifact

5. Click Upload Artifact to upload the artifact.

Deploy new Flink applications

Perform the following steps to deploy a Flink application using the SDP UI.

Steps

1. Log in to the SDP.

2. Go to Analytics > Analytics Projects > Flink > Apps > project-name > Create New App.

38 Working with Apache Flink Applications

In the form that appears, specify the name for the application, the artifact source and main class, and the following configuration information:

Field Description

General Provide the name for the application.

Source Source selections: MavenUse the Artifact drop-down to find the

artifact version. If you do not see the version, click the Upload link. Upload a new version of your code. An Upload Artifact form is displayed with choices for the Maven Group, Maven Artifact, Version, and a browse button to the JAR file. To change the version of the artifact, select a different version in the Artifacts drop- down menu.

FileSelect the Main Application File from the drop- down menu. The selection can be a URL, image path, or other location. Click the Upload link to upload an Artifact File, using a path in the repository. Browse to choose any type of file to upload other than JAR.

Custom PathThe Main Application File can be a URL, image path, or other location. Main ClassThis choice is optional. This class is the program entry point. This class is only needed if it is not specified in the JAR file manifest.

Configuration ParallelismEnter the number of tasks to be run in parallel.

NOTE: This number is the Default parallelism if it has not been explicitly set in the application.

Working with Apache Flink Applications 39

Field Description

Flink VersionChoose the version of the Flink cluster where the application is to run.

Add PropertyAssociate key-value pairs. Add StreamEnter a key for the values to be extracted

from a specified Pravega scope and stream. You can specify an existing stream or create a new stream.

3. If you do not see the version of the artifact in the Artifact drop-down, click the Upload link to upload a new version of your application. This choice takes you to the Upload Artifact view. When the upload is complete, you can change the version of the artifact by selecting a different version in the Artifacts drop-down menu.

4. Specify a Main Class if you did not specify one in the JAR manifest file.

5. Under the Configuration section, under Parallelism, enter the number of tasks that run in parallel.

6. Click Add Property to provide additional properties, such as key pairs. For example, the Key might be City and the Value might be the name of a city.

7. Click Add Stream to add a new stream to the analytics application. Enter your stream name, segment scaling, and retention policy choices in the Create Stream form. For details about completing this form, see the stream configuration attributes section of this guide.

8. Click Save to save your stream settings. This choice creates a new stream within your and takes you back to the Create New App form.

9. Ensure that all the sections of the form have been completed as required and then click Create.

10. You are returned to the Analytics Project screen, where you can view the status of the deployed application properties, savepoints, events, and logs views. Note: While the application is running, you cannot start or stop it, however you have the option under the Actions column to Edit or Delete it.

View and delete artifacts

Perform the following procedure to view and delete project artifacts.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > project-name and click the Artifacts tab. The Artifacts view is displayed.

3. Click the Maven or the Files button to display artifacts or files.

4. Under the Actions column, click Delete to delete the artifact or file from the project.

5. Click Download to download a copy.

Connect to the Maven repository using the mvn CLI

You can connect to the project's Maven repository by running Maven's mvn CLI tool as described in this section.

Steps

1. Connect to the Maven repository using mvn. Note that Maven must be installed locally.

2. Edit the ~/.m2/settings.xml file to include a section and enter your username and password.

3. Edit the projects pom.xml file to include a section. The section must contain both a hostname and a port.

4. Publish your project using mvn deploy.

5. Provide the following:

Parameter Description

Group ID This is usually a reverse domain name. For example, com.dellemc

40 Working with Apache Flink Applications

Parameter Description

Artifact ID Name of the artifact file

Version Version number of the artifact file

Packaging For example, a JAR file

Name of the file to be uploaded Include the path if required

Repository ID The repository ID should match the Server ID that you set in the Maven settings

Using Gradle

To configure a Gradle project to publish to a Maven repository using HTTP Basic authentication, you must first add a publishing section to your gradle.build file.

Project-members can upload/download applications and artifacts to the Maven repository for the project(s) they have been provided access to using the SDP UI or through Maven/Gradle using HTTP basic authentication.

Platform administrators can upload/download applications and artifacts across all projects and Maven repositories.

The standard maven-publish Gradle plugin must be added to the project in addition to a credentials section. For example:

gradle.build

apply plugin: "maven-publish" publishing { repositories { maven { url = "http:// " credentials { username "project-member-uid" password "password" } authentication { basic(BasicAuthentication) } } }

publications { maven(MavenPublication) { groupId = 'com.dell.com' artifactId = 'anomaly-detection' version = '1.0'

from components.java } } }

Once defined, publish the project by running something similar to the following Gradle command:

$ ./gradlew :anomaly-detection:publishMavenPublicationToMavenRepo

About Apache Flink applications Stream processors, such as Apache Flink, support event-time processing to produce correct and deterministic results with the ability to process large amounts of data in real time.

These analytics engines are embedded in the SDP, providing a seamless way for development teams to deploy analytics applications.

The SDP provides a managed environment for running Apache Flink Big Data analytics applications, providing:

Working with Apache Flink Applications 41

Consistent deployment A UI to stop and start deployments An easy way to upgrade clusters and applications

NOTE: This guide explicitly does not go into how to write streaming applications. To learn about developing stream

processing applications using Apache Flink or Apache Spark, see the following:

Software Link

SDP Code Hub code samples, demos, and connectors For a description, see the section in this guide: SDP Code Hub streaming data examples and demos

Apache Flink open-source documentation https://flink.apache.org

Ververica Platform documentation https://www.ververica.com/

Pravega concepts documentation and Pravega workshop examples

http://pravega.io/docs/latest/pravega-concepts/

https://streamingdataplatform.github.io/code-hub/

Application lifecycle and scheduling A streaming data application flows through several states during its lifecycle.

Apache Flink life cycle

Once a streaming data application is created using the SDP UI, the Apache Flink stream processing framework manages it.

This framework determines the state of the application and what action to take. For example, the system can either deploy the application or stop a running application.

Figure 9. Apache Flink streaming data application lifecycle

Lifecycle Description

Scheduling Scheduling is the process of locating a running Flink cluster on which to deploy the application as a Flink job. If multiple potential target clusters are found, then one of the clusters is chosen at random. The application remains in a scheduling state until a target cluster is identified.

The scheduler uses two pieces of information to match an application to a cluster:

1. Flink Version: For a cluster to be considered a deployment target for an application, the Flink versions must match. The flinkVersion specified in the FlinkApplication descriptor and the flinkVersion in the image that the FlinkCluster is based on must be the same.

2. Labels: The Flink version is enough to match an application with a cluster, however labels can also be used to provide more precise cluster targeting. Standard Kubernetes resource labels can be specified on the Flink resource. An optional clusterSelector label can be specified within an application descriptor to target a cluster.

42 Working with Apache Flink Applications

Lifecycle Description

Starting Starting is an important state indicating that the application has been deployed on the Flink cluster but not yet fully running. Flink is still scheduling tasks.

Started Once a target cluster has been identified, the application is deployed to the target cluster. The process requires two steps and can take several minutes as follows:

1. The Maven artifact is first resolved against the Maven repository to retrieve the JAR file.

2. The application is then deployed to the target Flink cluster.

NOTE: If an application is stopped, the Flink job is started from the recoveryStrategy.

Savepointing If an application that has been started is modified, the framework responds by first stopping the corresponding Flink job. What happens after a running Flink job is stopped depends upon the stateStrategy.

A Flink job that has state is canceled with a savepoint.

When an application is stateless, there is no state to save.

The None state can be used in situations where the application is stateless.

Stopped Once savepointing is completed, the application is returned to a stopped state. The state field in the application specification determines what happens next.

If the state is stopped, then the application remains in a stopped state until the specification is updated.

If the state is started, the application returns to the scheduling state and the deployment process restarts.

Error If the application is placed in an error state, see more about the error using the SDP web interface or application resource. For more information about debugging applications in an error state, see the section in this guide Troubleshooting deployments.

Working with Apache Flink Applications 43

View status and delete applications After you have deployed Flink clusters and applications, you can view their properties or delete them.

View and edit Flink applications

Log in to the SDP and go to Analytics > project-name. Click the Apps tab. You can also select Flink Apps from the Dashboard.

View and edit the following:

Application name State of the application, for example, Scheduling or Stopped Name of the Flink cluster the application is running on Cluster selectors, such as custom labels Flink version The time and date the application was created Available actions that you can perform on the application include Start, Stop, Edit, and Delete.

View application properties

Log on to the SDP and navigate to Analytics > project-name. Click the Properties tab.

Navigate the blue links at the top to view events, logs, the Flink UI, and the following:

Category Description

Source The Main class from the JAR manifest

Configuration Flink version Cluster Selectors, such as custom labels Job ID Parallelism Parameters

44 Working with Apache Flink Applications

View application savepoints

Log on to the SDP and navigate to Analytics > project-name. Click the Savepoints tab.

Navigate the blue links at the top to view application properties, events, logs, view running applications in the Flink UI, and the following:

Name of savepoint Path to savepoint When the savepoint was created State of the savepoint Available actions, for example, delete

For more information about savepoints, see Understanding checkpoints and savepoints

View application events

Log on to the SDP and navigate to Analytics > project-name. Click the Events link.

View all application event messages, the type and reason for the event. For example, Launch or Stop.

You can also view when the event was first and last seen, and the count of events.

View application deployment logs

Log on to the SDP and navigate to Analytics > project-name > app-name. Click the Logs link to view the deployment logs.

All logging by task managers is contained in the logs of the target Flink cluster.

For information about troubleshooting logs, see the section Application logging.

Users in the platform administrator role can access all of the system logs. For more information, see the section "Monitor application health" in the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/ content/docu103271.

View deployed applications in the Apache Flink UI

The SDP UI contains direct links to the Apache Flink UI.

Apache Flink provides its own monitoring interface, which is available from within the SDP UI. The Apache Flink UI shows details about the status of jobs and tasks within them. This interface is useful for verifying application health and troubleshooting.

To access the Apache Flink task managers view from the SDP UI, log on to the SDP and navigate to Analytics > Analytics Projects > project_name > flink_cluster_name.

View the Flink UI

You can view a deployed application in the SDP and Apache Flink UI.

Steps

1. Log on to the SDP.

2. Navigate to Analytics > Analytics Projects > project-name > app-name.

3. Click the Flink tab to open the job view in the Flink UI.

You can review: Status of running task managers Utilized and available task slots Status of running and completed jobs

Working with Apache Flink Applications 45

46 Working with Apache Flink Applications

Working with Apache Spark Applications This chapter describes using Apache SparkTM applications with the Streaming Data Platform (SDP).

Topics:

About Apache Spark applications Apache Spark terminology Apache Spark life cycle Understanding Spark clusters Create a new Spark application Create a new artifact View status of Spark applications View the properties of Spark applications View the history of a Spark application View the events for Spark applications View the logs for Spark applications Deploying Spark applications Troubleshooting Spark applications

About Apache Spark applications The SDP 1.2 release introduces support for Apache SparkTM. The Apache Spark integration allows users to manage applications through Kubernetes resources or within the SDP UI.

Apache Spark integration with SDP

Highlights include:

Support for Java, Scala, and Python (PySpark) Spark analytic metrics within the SDP UI Seamless integration with Apache Spark checkpoints Parallel Readers and Writers supporting high-throughput and low-latency processing Exactly-once processing guarantees for both Readers and Writers for micro-batch streaming

Developer community support

If you need help developing Spark applications with Pravega, contact the developer community on the Pravega Slack channel. To learn more about developing stream processing applications using Apache Spark:

See the Apache Spark open-source documentation at https://spark.apache.org. See SDP PySpark workshop examples at https://github.com/StreamingDataPlatform/workshop-samples/tree/master/

spark-examples. For information about developing Spark applications locally and testing with Pravega in a sandbox environment before

deploying to SDP, see: https://github.com/pravega/pravega-samples/blob/spark-connector-examples/spark-connector- examples/README.md.

4

Working with Apache Spark Applications 47

Apache Spark terminology The following terms are useful in understanding Apache Spark applications.

Spark application An analytic application that uses the Apache Spark API to process one or more streams.

Spark job A running Spark application. A Spark job consists of many running tasks.

Spark task A Spark task is the basic unit of performance. One thread performs each task.

RDD Resilient Distributed Dataset. The basic abstraction in Spark that represents an immutable, partitioned collection of elements that can be operated on in parallel.

Batch processing The input data is partitioned, and the data queried on executors before writing the result. The job ends when the query has finished processing the input data.

Streaming micro- batch

A Spark micro-batch reader allows Spark streaming applications to read Pravega streams and creates a micro-batch of events that are received during a particular time interval.

SDP supports micro-batch streaming that can use either exactly-once or at-least-once semantics.

Pravega stream cuts (offsets) are used to reliably recover from failures and provide exactly-once semantics.

A query processes the micro-batch, and the results are sent to an output sink. Spark starts processing the next micro-batch when the previous micro-batch ends.

Streaming continuous

SDP 1.2 and Pravega do not support continuous streaming with Apache Spark applications.

Block manager Responsible for tracking and managing blocks of data between executors nodes. Data is transferred directly between executors.

Spark SQL A layer that is built on the standard Spark RDD and dataframes. Allows applications to specify SQL to perform queries. Known as Structured Streaming when used with Apache Spark streaming.

Apache Spark life cycle The Spark application life cycle consists of goal and transient states.

Once a streaming data application is created, the Spark stream processing framework determines the state of the application and what action to take. For example, the system can either deploy the application or stop a running application. The following shows the life cycle of a SparkApplication.

The SparkApplication state field informs SDP of the intended state of the application. SDP attempts to get the application into the wanted state based on its current state. Transient states are stages that the application goes through when attempting to reach a goal state.

Life cycle Description

Submitting Submitting is the first transient state. This state is repeated if the application fails.

Running If the state is running, SDP deploys the application.

Failed

Stopping Stopping is a transient state with the goal state of stopped.

Stopped If the state is stopped, the application is not deployed. SDP does nothing to the application.

Finished Finished is the goal state.

The goal state of finished is intended for batch applications. A user can resubmit an application in the finished state by first changing the state to stopped and then back to finished.

48 Working with Apache Spark Applications

Understanding Spark clusters The Spark cluster is the application.

With Spark, there is no concept of a session cluster as there is with Flink. The Spark cluster is split into the driver and the executors. The driver runs the main application and manages the scheduling of tasks. The executors are workers that perform tasks. SparkContext is the cluster manager and is created as the first action in any Spark application. SparkContext communicates with Kubernetes to create executor pods. Application actions are split into tasks and scheduled on executors.

Create a new Spark application Perform the following steps to create a new Spark application using the SDP UI.

Steps

1. Log in to the SDP.

2. Go to Analytics > Analytics Projects > project-name > Spark > Create New Spark App. Click the Create New Spark App button.

Figure 10. Create new Spark app

The Spark app form displays.

Working with Apache Spark Applications 49

Figure 11. Spark app form

In the form that is displayed, specify the following configuration information:

Field Description

General Provide the type, name for the application, and its state.

Source Source selections: MavenUse the Artifact drop-down to find the

artifact version. If you do not see the version, click the Upload link. Upload a new version of your code. An Upload Artifact form is displayed with choices for the Maven Group, Maven Artifact, Version, and a browse button to the JAR file. To change the version of the artifact, select a different version in the Artifacts drop- down menu.

FileSelect the Main Application File from the drop- down menu. The selection can be a URL, image path, or other location. Click the Upload link to upload an Artifact File, using a path in the repository. Browse to choose any type of file to upload other than JAR.

50 Working with Apache Spark Applications

Field Description

Custom PathThe Main Application File can be a URL, image path, or other location. Main ClassThis choice is optional. This class is the program entry point. This class is only needed if it is not specified in the JAR file manifest.

Configuration RuntimeSelect the Spark version from the drop-down menu.

ExecutorsChoose the number of executors. Executor MemoryThe amount of memory is in GB. Executor CoresThis selection is the number of

executor cores. Add ArgumentThis list of string arguments is passed

to the application.

Local Storage Storage TypeSelect Ephemeral or Volume Based. Volume SizeThis size is in GB.

Driver MemoryThis entry is the amount of memory in GB. CoresThis entry is the number of cores. Storage TypeSelect Ephemeral or Volume Based. Volume SizeThis entry is the size in GB.

Working Directory File ArtifactsSelect the Spark version from the drop- down menu.

Additional Custom PathThis entry is the custom path JAR dependencies to be added to the classpath.

Additional Jars For additional jars to be added to the classpath: Maven JarsSelect the Spark version from the drop-

down menu. File JarsChoose the number of executors. Additional Custom PathCustom path JAR

dependencies to be added to the classpath.

3. Ensure that all the sections of the form have been completed as required and then click Create.

4. You are returned to the Analytics Project screen where you can view the status of the Spark application. Note: While the application is running, you cannot start or stop it; however, you have the option under the Actions column to Edit or Delete it.

Create a new artifact

About this task

Artifacts are added to an analytics project on the Create New Spark App form when you define an application in the SDP UI. You can upload Maven JAR files, individual application files, or custom paths to the SDP UI.

To manage Maven dependencies artifacts, use one of the following two methods as outlined in Steps #1 and #2.

Steps

1. Method A: Git clone the repository for spark-connectors and pravega-keycloak:

Run the following gradlew command to build the artifacts:

./gradlew install

and then upload the JAR file using the SDP UI as shown in Step #3.

2. Method B: Download the pre-built Spark connector for Pravega artifacts from Maven Central and/or JFrog: pravega-connectors-spark-3.0

Working with Apache Spark Applications 51

pravega-connectors-spark-2.4 JFrog (snapshots from jfrog.org) and then upload the JAR file using the SDP UI as shown in Step #3.

3. Upload artifacts using the Create New Spark App form under the Source section. Choose either the Maven (for JAR files) or File (for Python or Scala application files) as follows. MavenUse the Artifact drop-down to find the artifact version. If you do not see the version, click the Upload link.

Upload a new version of your code. An Upload Artifact form is displayed with choices for the Maven Group, Maven Artifact, Version, and a browse button to the JAR file. To change the version of the artifact, select a different version in the Artifacts drop-down menu. Click the Maven button.

Figure 12. Create Maven artifact

A screen similar to the following displays after uploading Maven artifacts:

Figure 13. Maven artifacts FileSelect the Main Application File from the drop-down menu. The selection can be a URL, image path, or other

location. Click the Upload link to upload an Artifact File, using a path in the repository. Browse to choose an application file to upload.

Figure 14. Create application file artifact The selected file is uploaded when the artifact is created. A screen similar to the following displays after uploading application file artifacts:

52 Working with Apache Spark Applications

Figure 15. Application file artifacts

4. Change any configurations as required, such as the Spark version, dependencies, the application name of the deployment yaml file, and so on.

5. Run the following kubectl commands on the cluster to apply the changes and deploy the application:

kubectl apply -f generate_data_to_pravega.yaml -n namespace kubectl apply -f pravega_to_console_python.yaml -n namespace

NOTE: You must be a federated user with LDAP as the identify provider to run Kubernetes commands on the cluster.

For more information, see the Configure federated user accounts section: Dell EMC Streaming Data Platform

Installation and Administration Guide at https://dl.dell.com/content/docu103271

View status of Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, select Spark Apps. You can also select Spark Apps from the Dashboard. View the following:

Application name Type State of the application, for example, Submitting, Running, or Stopped Goal state Runtime Executors Available actions that you can perform on the application. For example, Start, Stop, Edit, and Delete.

Working with Apache Spark Applications 53

Figure 16. Spark App status

View the properties of Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application properties:

1. Select Spark Apps. 2. Select an app. 3. Select the Properties tab.

Figure 17. Spark application properties

54 Working with Apache Spark Applications

View the history of a Spark application Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application history:

1. Select Spark Apps. 2. Select an app. 3. Select the History tab.

Figure 18. Spark application history

View the events for Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application events:

1. Select Spark Apps. 2. Select an app. 3. Select the Events tab.

Working with Apache Spark Applications 55

Figure 19. Spark application events

View the logs for Spark applications Select the Spark Apps tab to view the status of Spark applications.

Log in to the SDP and go to Analytics > project-name. For a Spark application, under Analytics Projects select a Spark project.

NOTE: To reach the Spark UI, select the up arrow Spark link.

To view the Spark application logs:

1. Select Spark Apps. 2. Select an app. 3. Select the Logs tab.

Figure 20. Spark application logs

56 Working with Apache Spark Applications

Deploying Spark applications To run on SDP, Apache Spark applications must be configured using the options described in the table below. All Spark applications must have the environment variables set appropriately for your environment. However, it is up to your application to retrieve the environment variables and provide them as options to the Spark session.

Apache Spark option Environment variable Purpose

controller PRAVEGA_CONTROLLER_URI This is the Pravega controller URI for your SDP cluster. For example,

tcp://nautilus-pravega- controller.nautilus- pravega.svc.cluster.local:909 0

scope PRAVEGA_SCOPE This is the Pravega scope, which is equal to the Analytics Project name.

allow_create_scope N/A This must be set to false since applications in SDP cannot create Pravega scopes.

checkpointLocation CHECKPOINT_DIR Stateful operations in Spark must periodically write checkpoints which can be used to recover from failures. The checkpoint directory identified by this environment variable should be used for this purpose. It will be highly available until the Spark application is deleted. This should be used even for Spark applications which do not use Pravega.

Writing Apache Spark applications for deployment on SDP

Use the following example applications as a template when writing Apache Spark applications:

Python: stream_generated_data_to_pravega.py

Scala: StreamPravegaToConsole.scala

Uploading common artifacts to your analytics project

This task must be completed before running Apache Spark applications with SDP.

Steps

1. Download the Pravega Connectors for Spark 3.0 JAR from Maven Central.

2. Download the Pravega Connectors for Spark 2.4 JAR from Maven Central.

3. Download the Pravega Keycloak Client JAR from Maven Central.

4. Create an Analytics Projects in SDP if youi do not already have one. An Analytics Project can have any number of Flink and Spark applications.

NOTE: In the following steps, be careful to upload the correct file corresponding to the Maven artifact.

5. Upload the Pravega Connectors for Spark 3.0 JAR to SDP. a. Click Analytics > Artifacts > Maven > Upload Artifact. b. Enter the following:

Maven Group: io.pravega Maven Artifact: pravega-connectors-spark-3-0-2-12 Version: 0.9.0 File: pravega-connectors-spark-3.0_2.12-0.9.0.jar

Working with Apache Spark Applications 57

c. Click Upload.

6. Upload the Pravega Connectors for Spark 2.4 JAR. a. Click Analytics > Artifacts > Maven > Upload Artifact. b. Enter the following:

Maven Group: io.pravega Maven Artifact: pravega-connectors-spark-2-4-2-11 Version: 0.9.0 File: pravega-connectors-spark-2.4_2.11-0.9.0.jar

c. Click Upload.

7. Upload the Pravega Keycloak Client JAR. a. Click Analytics -> Artifacts -> Maven -> Upload Artifact. b. Enter the following:

Maven Group: io.pravega Maven Artifact: pravega-keycloak-client Version: 0.9.0 File: pravega-keycloak-client-0.9.0.jar

c. Click Upload.

8. You should now see the following artifacts in the SDP UI.

Deploying Python applications using the SDP UI

Use the steps in this section to deploy Apache Spark applications in the SDP UI.

Steps

1. Create your Python Spark application file. For example, stream_generated_data_to_pravega.py will continuously write a timestamp to a Pravega stream. Save this file as on your local computer.

2. Upload your Python Spark application.

a. Click Analytics > Artifacts > Files > Upload File. b. Enter the following:

Path: (leave blank) File: stream_generated_data_to_pravega.py Click Upload.

3. Create the Spark application.

a. Click Spark > Create New Spark App. b. Enter the following:

58 Working with Apache Spark Applications

General Type: Python Name: stream-generated-data-to-pravega

Source File Main Application File: stream_generated_data_to_pravega.py

Configuration Runtime: Spark 3.0.1 Executors: 1

Additional Jars Maven Jars:

io.pravega:pravega-connectors-spark-3-0-2-12:0.0.0 io.pravega:pravega-keycloak-client:0.9.0

c. Click Create.

4. You may view the application logs by clicking the Logs link.

NOTE: You can also run an application directly from a URL. Use the Source type of Custom Path, then

enter a URL such as https://github.com/pravega/pravega-samples/blob/spark-connector-examples/spark-connector-

examples/src/main/python/stream_generated_data_to_pravega.py

CAUTION: You should only run an application from a URL if you trust the owner of the URL.

Deploying Python applications using Kubectl

Much of the Spark application deployment process can be scripted with Kubectl.

Steps

1. Create your Python Spark application file. For example, stream_generated_data_to_pravega.py will continuously write a timestamp to a Pravega stream. Save this file as on your local computer.

2. Use the SDP UI to create and upload your Python Spark application file using the instructions in the previous section for details.

3. Create a YAML file that defines how the Spark application should be deployed. The following is an example YAML file that can be used to deploy stream_generated_data_to_pravega.py. Save this file as stream_generated_data_to_pravega.yaml on your local computer.

apiVersion: spark.nautilus.dellemc.com/v1beta1 kind: SparkApplication metadata: name: stream-generated-data-to-pravega spec: # Language type of application, Python, Scala, Java (optional: Defaults to Java if not specified) type: Python # Can be Running, Finished or Stopped (required) state: Running # Main Java/Scala class to execute (required for type:Scala or Java)

Working with Apache Spark Applications 59

# mainClass: org.apache.spark.examples.SparkPi # Signifies what the `mainApplicationFile` value represents. # Valid Values: "Maven", "File", "Text". Default: "Text" mainApplicationFileType: File # Main application file that will be passed to Spark Submit # (interpreted according to `mainApplicationFileType`) mainApplicationFile: "stream_generated_data_to_pravega.py" # Extra Maven dependencies to resolve from Maven Central (optional: Java/Scala ONLY) jars: - "{maven: io.pravega:pravega-connectors-spark-3-0-2-12:0.9.0}" - "{maven: io.pravega:pravega-keycloak-client:0.9.0}" parameters: - name: timeout value: "100000" # Single value arguments passed to application (optional) arguments: [] # - "--recreate" # Directory in project storage which will be handed to Application as CHECKPOINT_DIR environment # variable for Application checkpoint files (optional, SDP will use Analytic Project storage) # checkpointPath: /spark-pi-checkpoint-d5722b45-8773-41db-93a4-bab2324d74d0 # Number of seconds SDP will wait after signalling shutdown to an application before forcabily # killing the Driver POD (optional, default: 180 seconds) gracefulShutdownTimeout: 180 # When to redeploy the application # (optional, default: OnFailure, failureRetries: 3, failureRetryInterval: 10, submissionRetries: 0) retryPolicy: # Valid values: Never, Always, OnFailure type: OnFailure # Number of retries before Application is marked as failed failureRetries: 5 # Number of seconds between each retry during application failures failureRetryInterval: 60 # Defines shape of cluster to execute application # Reference to a RuntimeImage containing the Spark Runtime to use for the cluster runtime: spark-3.0.1 # Driver Resource settings driver: cores: 1 memory: 512M # Kubernetes resources that should be applied to the driver POD (optional) resources: requests: cpu: 1 memory: 1024Mi limits: cpu: 1 memory: 1024Mi # Executor Resource Settings executor: replicas: 1 cores: 1 memory: 512M # Kubernetes resources that should be applied to each executor POD resources: requests: cpu: 1 memory: 1024Mi limits: cpu: 1 memory: 1024Mi # Extra key value/pairs to apply to configuration (optional) configuration: spark.reducer.maxSizeInFlight: 48m spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp" # Custom Log4J Logging Levels (optional) logging: org.myapp: DEBUG io.pravega: DEBUG

60 Working with Apache Spark Applications

io.pravega.connectors: DEBUG ```

4. Apply the YAML file with kubectl.

NAMESPACE=sparktest1 kubectl apply -n $NAMESPACE -f stream_generated_data_to_pravega.yaml

5. Monitor the Spark application.

kubectl get -n $NAMESPACE SparkApplication/stream-generated-data-to-pravega

Related information

Deploying Python applications using the SDP UI on page 58

Deploying Java or Scala applications using the SDP UI

Use the steps in this section to deploy Java or Scala applications in the SDP UI.

Steps

1. Prepare your Java/Scala build environment.

2. Compile your Java or Scala source code to produce a JAR file. The example Scala application StreamPravegaToConsole.scala will read strings from a Pravega stream and log

them. This will build the file pravega-samples/spark-connector-examples/build/libs/pravega-spark- connector-examples-0.9.0.jar.

git clone https://github.com/pravega/pravega-samples cd pravega-samples git checkout spark-connector-examples -- cd spark-connector-examples ../gradlew build

3. Upload your Spark application JAR.

a. Click Analytics > Artifacts > Maven > Upload Artifact. b. Enter the following. It is a best practice for the group, artifact, and version to match what is in the file, but this is not

required.

Maven Group:: io.pravega Maven Artifact:: pravega-spark-connector-examples Version: 0.9.0 File: pravega-spark-connector-examples-0.9.0.jar

c. Click Upload.

4. Create the Spark application.

a. Click Spark > Create New Spark App. b. Enter the following:

General Type: Scala Name: pravega-to-console-scalaa

Source Maven Artifact: io.pravega:pravega-spark-connector-examples:0.9.0 Main Class: io.pravega.example.spark.StreamPravegaToConsole

Configuration Runtime: Spark 3.0.1 Executors: 1

Additional Jars

Working with Apache Spark Applications 61

Maven Jars: io.pravega:pravega-connectors-spark-3-0-2-12:0.9.0 io.pravega:pravega-keycloak-client:0.9.0

c. Click Create.

5. You may view the application logs by clicking the Logs link.

When running the above example, you will see output similar to the following every three seconds.

------------------------------------------- Batch: 29 ------------------------------------------- +---------------------+--------+-----------------+----------+------+ |event |scope |stream |segment_id|offset| +---------------------+--------+-----------------+----------+------+ |2021-03-12 23:37:44.1|examples|streamprocessing1|0 |62582 | |2021-03-12 23:37:46.1|examples|streamprocessing1|0 |62611 | |2021-03-12 23:37:45.1|examples|streamprocessing1|0 |62640 | +---------------------+--------+-----------------+----------+------+

Troubleshooting Spark applications This section includes common error messages when troubleshooting Spark applications.

Authentication failed

Symptom:

io.pravega.shaded.io.grpc.StatusRuntimeException: UNAUTHENTICATED: Authentication failed

Causes: Keycloak JAR is not included. If this error occurs when attempting to create a scope, make sure the option allow_create_scope is set to false.

62 Working with Apache Spark Applications

Working with Analytic Metrics

Topics:

About analytic metrics for projects Enable analytic metrics for a project View analytic metrics Add custom metrics to an application Analytic metrics predefined dashboard examples

About analytic metrics for projects SDP includes capabilities for automatic collection and visualization of detailed metrics for project applications. SDP generates separate dashboards for each Flink cluster, Spark application, and Pravega Search cluster defined within a project.

Description Analytic metrics is an optional feature within a project. At project creation, the administrator chooses whether to enable or disable project-level analytic metrics. The feature is enabled by default.

Project metrics are valuable for:

historical view of long running applications performance analysis insight into application conditions troubleshooting

The Metrics feature is enabled or disabled per project, at project creation time. When Metrics is enabled, SDP creates a metrics stack for the project. The stack contains: Project-specific InfluxDB database Project-specific Grafana deployment and access point

Separate Flink, Spark, and PSearch dashboards

SDP visualizes metrics in dashboards based on predefined templates. Separate dashboards are used for each Flink cluster, each Spark application, and the Pravega Search cluster created in the project. The default dashboards include the following:

Cluster type

Dashboard contents

Flink clusters

Shows metrics for Flink jobs, tasks, and task managers. When you delete the Flink cluster, the system deletes the cluster's analytic dashboard and metrics.

Spark applications

Shows processing progress and other metrics. The Spark integration uses a Telegraf server to collect metrics. If you delete the Spark application, the system also deletes the application's analytic dashboard and metrics.

Pravega Search cluster

Shows metrics for Pravega Search components, such as indexworkers and shardworkers. When you delete the Pravega Search cluster, the system deletes the cluster's dashboard and metrics.

Customizations Two types of customizations are supported: Advanced Grafana users may create their own customized dashboards. Application developers may include customized metrics in their applications. Applications can emit

custom metrics to InfluxDB, and those metrics can be visualized by customized Grafana templates.

Metrics retention period

The retention period for metrics data is two weeks.

Dashboard access

Access to project dashboards and metrics is controlled by project membership.

5

Working with Analytic Metrics 63

Enable analytic metrics for a project The analytic metrics feature is enabled separately for each project, at project creation time.

About this task

SDP platform administrators create analytic projects. They can choose to enable or disable Metrics.

NOTE: It is not possible to enable metrics after project creation.

Steps

1. Log in to the SDP UI as an admin.

2. Create a project as described in the Dell EMC Streaming Data Platform Installation and Administration Guide at https:// dl.dell.com/content/docu103271.

3. In the Metrics field, ensure Enabled is selected.

Enabled is the default selection.

Results

If project creation is successful, the metrics stack is deployed. Metrics are collected for each cluster in the project.

View analytic metrics Project members can access a project's Grafana UI from the project page in the SDP UI. The Grafana UI lists all the project's dashboards. There is one dashboard for each Spark, Flink, and Pravega Search cluster in the project.

Prerequisites

At project creation time, the Metrics field must be set to Enabled.

About this task

When the Metrics feature is enabled in a project, SDP creates the metrics stack for the project. Metrics are generated for each Spark, Flink, or Pravega Search cluster in the project. These metrics are visualized on Grafana dashboards. Each Flink cluster, each Spark application, and the PSearch cluster (if it is defined) have their own dashboard.

To access the metrics dashboards for a Flink project, follow these steps:

Steps

1. Go to Analytics > project-name.

2. Click the Metrics link in the page header.

64 Working with Analytic Metrics

The link opens the Grafana UI.

3. In the tools pane on the left, choose Dashboards > Manage. A list of metrics dashboards appears in the main pane. One dashboard exists for each cluster defined in the project. There may be dashboards for multiple Flink clusters, multiple Spark applications, and one PSearch cluster.

4. Click the name of the dashboard of interest to display that dashboard.

Add custom metrics to an application Application developers may deploy custom metrics in applications by specifying InfluxDB and Grafana dashboard resources within their Helm chart.

Prerequisites

The Metrics option must be enabled at project creation time by the platform administrator. Users who are members of the project have access.

About this task

In the following Flink longevity metrics example, the developer would need to know what the secret is, and how to use it, in order to connect to the InfluxDB Database. The developer would add the secret and bind connection details to Environment variables within the application, which it then uses to set up the metrics connection.

The order that components are created are resolved by Kubernetes, since the POD cannot start until the secret, and thus the InfluxDB, has been created.

The application would need to do the following.:

Steps

1. Include an InfluxDBDatabase resource in the Helm chart.

apiVersion: metrics.dellemc.com/v1alpha1 kind: InfluxDBDatabase metadata: name: flink-longevity-metrics spec: influxDBRef: name: project-metrics

2. Include a GrafanaDashboard resource in the Helm chart.

apiVersion: metrics.dellemc.com/v1alpha1 kind: "GrafanaDashboard"

Working with Analytic Metrics 65

metadata: name: "flink-cluster" spec: dashboard: | { .... "panels": [ { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "flink-longevity-metrics", .... }, }

3. Configure the application parameters to pick up the details from the secret.

apiVersion: flink.nautilus.dellemc.com/v1beta1 kind: "FlinkApplication" metadata: name: flink-longevity-duplicate-check-0" spec: ... ... ... parameters: - name: readerParallelism value: "1" - name: checkpointInterval value: "10" - name: metrics type: secret value: "flink-longevity-metrics"

Analytic metrics predefined dashboard examples Dashboards provide preconfigured visualizations of an InfluxDB database. The following example shows predefined dashboard templates from analytic engine providers that are included with Streaming Data Platform.

Flink cluster metrics dashboard

The default Flink dashboard shows metrics for Flink jobs, tasks, and task managers. You can select the job of interest from the dropdown in the upper left. The following image shows only a few of the graphs.

66 Working with Analytic Metrics

Spark application metrics dashboard

The default Spark metrics dashboard lists jobs, stages, storage, environment, executors, and SQL.

Working with Analytic Metrics 67

Figure 21. Spark Jobs

Pravega Search cluster metrics dashboard

The Pravega Search metrics dashboard includes graphs for each of the service types (controller, resthead, indexworker, queryworker, shardworker). It also includes graphs for CPU and memory usage. The following image shows only a few of the graphs.

68 Working with Analytic Metrics

Working with Analytic Metrics 69

Working with Pravega Streams A Pravega stream is an unbounded stream of bytes or stream of events. The Pravega engine ingests the unbounded streaming data and coordinates permanent storage. Storage is configured by the platform administrator using the SDP UI. This section describes how to create, modify, and monitor Pravega streams

Topics:

About Pravega streams About stateful stream processing About event-driven applications About exactly once semantics Understanding checkpoints and savepoints About Pravega scopes About Pravega streams Create data streams Stream configuration attributes List streams (scope view) View streams Edit streams Delete streams Start and stop stream ingestion Monitor stream ingestion About Grafana dashboards to monitor Pravega

About Pravega streams Pravega writer RESTful APIs write streaming data to the Pravega stream store. Pravega is an open-source project sponsored by Dell EMC. Customized software connectors, such as the integrated Apache Flink and Apache Spark connectors, allow streaming applications, such as Apache Flink and Apache Spark, access to Pravega.

Pravega handles all types of streams, including:

unbounded streams bounded streams event-based messages batched logs streaming video any other type of stream

Pravega streams are based on an append-only log data structure. By using append-only logs, Pravega rapidly ingests data into durable storage.

Pravega handles unbounded streaming bytes and seamlessly coordinates a two-tiered storage system for each stream:

Pravega manages temporary storage for the recently ingested tail of a stream. Pravega infinitely tiers ingested data into long-term storage, such as a Dell EMC PowerScale cluster or ECS object storage.

Before running applications that write or read streams to Pravega, the stream must be defined and configured in the SDP as described in the Create data streams section.

For more information about Pravega streams, see http://www.pravega.io. For more information about long-term storage, see the Dell EMC Streaming Data Platform Installation and Administration

Guide at https://dl.dell.com/content/docu103271.

6

70 Working with Pravega Streams

About stateful stream processing Stateful stream processing is an application design pattern for processing unbounded streams of events and is applicable to many different use cases. Any application that processes a stream of events and does not perform record-at-a-time transformations must be stateful with the ability to store and access intermediate data.

When an application receives an event, it can perform arbitrary computations that involve reading data from or writing data to the state. In principle, state can be stored and accessed in many different places including program variables, local files, or embedded or external databases.

Apache Flink or Spark stores the application state locally in memory or in an embedded database. Flink and Spark, as distributed systems, periodically write application state checkpoints to remote storage. The local state is protected against failures and data loss.

If a failure occurs, Flink and Spark can recover a stateful streaming application by restoring its state from a previous checkpoint and resetting the read position on the event log. The application replays and fast forwards the input events from the event log until it reaches the tail of the stream. This technique is used to recover from failures but can also be leveraged to update an application, fix issues, and repair previously emitted results, or migrate an application to a different cluster.

About event-driven applications Event-driven applications are stateful streaming applications that ingest event streams and process the events with application- specific logic. Depending on the logic, an event-driven application can trigger actions, such as sending an alert or an email, or write events to an outgoing event stream to be consumed by another event-driven application.

Typical use cases for event-driven applications include:

Anomaly detection Pattern detection or complex event processing Real-time recommendations

Event-driven applications communicate using event logs instead of RESTful calls and hold application data as local state instead of writing it to and reading it from an external datastore, such as a relational database or key-value store.

Exactly-once state consistency and the ability to scale an application are fundamental requirements for event-driven applications.

About exactly once semantics Stream processing applications are prone to failures due to the distributed nature of deployment they must support. Tasks are expected to fail. The processing framework must provide guarantees that it maintains the consistency of the internal state of the stream processor despite crashes.

The SDP delivers end-to-end exactly once guarantees. Exactly once semantics mean that each incoming event affects the final result exactly once. If a machine or software fails, there is no duplicate data and no data that goes unprocessed. Exactly once semantics consumes less storage than traditional systems because duplication is avoided. There is one version of the data--one source of truth--that multiple analytic applications can access.

These exactly once guarantees are meaningful since you should always assume the possibility of network or machine failures that might result in data loss. Exactly once semantics means that the streaming application must observe a consistent state after recovery compared to the state before the crash.

Many traditional streaming data systems duplicate data for usage in different pipelines. With Pravega, data is ingested and protected in one place, with one single source of truth for all processing data pipelines.

A stream processing application can provide end-to-end exactly once guarantees only when the external data source (to which the source and sink operators are interacting) supports the following:

An event is not delivered more than once (duplicate events) to any source operator. For example, if there are multiple source operators (parallelism > 1) that are reading from the external data source, then each source operator is expected to see unique sets of events. The events are not duplicated across the source operators.

Source operator could consume data from the external data source from a specific offset. The sink operator could request the external data source to perform a commit or to rollback operations.

Working with Pravega Streams 71

Pravega exactly once semantics

Streaming applications process large amounts of continuously arriving data. In Pravega, the stream is a storage primitive for continuous and unbounded data. Pravega ingests unbounded streaming data in real time and coordinates permanent storage.

Once the writes are acknowledged to the client, Pravega model guarantees data durability, message ordering, and exactly once support. Duplicate event writes are not allowed. With message ordering, events within the same routing key are delivered to readers in the order they were written.

Pravega streams are based on an append-only log data structure. By using append-only logs, Pravega rapidly ingests data into durable storage. Pravega handles all types of streams, including:

Unbounded streams in real time (continuous flow of data where the end position is unknown) Bounded streams (start and end boundaries are defined and well-known) event-based messages Batched logs Any other type of stream

Pravega seamlessly coordinates a two-tiered storage system for each stream.

Temporary storage for the recently ingested tail of a stream Long-term storage on Dell EMC PowerScale or ECS

Users can configure data retention periods.

Applications, such as a Java program reading from an IoT sensor, write data to the tail of the stream. Analytics applications, such as Flink, Spark, or Hadoop jobs, can read from any point in the stream. Many applications can read and write the same stream in parallel. Pravega design highlights include elasticity, scalability, and support for large volumes of streamed data and applications.

The same API call accesses both real-time and historical data stored in Pravega. Applications can access data in near-real time, past and historical time, or at any arbitrary time in the future in a uniform fashion.

Pravega streams are durable, ordered, consistent, and transactional. Pravega is unique in its ability to handle unbounded streaming bytes. As a high-throughput store that preserves ordering of continuously streaming data, Pravega infinitely tiers ingested data into Dell EMC PowerScale long-term storage.

Apache Flink exactly once semantics

Specialized connectors enable applications to perform read and write operations over Pravega streams.

The Pravega Flink Connector is a data integration component that can read and write Pravega streams with the Apache Flink stream processing framework. The Apache Flink connector offers seamless integration with the Flink infrastructure, thereby ensuring parallel reads/writes, exposing Pravega streams as Table API records, allowing checkpointing, and guaranteeing exactly once processing with Pravega.

Apache Flink is a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. By combining the features of Apache Flink and Pravega, you can build a pipeline of multiple Flink applications. This pipeline can be chained together to give end-to-end exactly once guarantees across the chain of applications.

Flink provides a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It performs computations at in-memory speed and at any scale. The order of data during processing is guaranteed. Apache Flink adds additional enterprise-ready functionality to the open source Apache Flink project, including:

Integrated for container orchestration Continuous integration/continuous delivery (CI/CD) job pipelines Logging and metrics State storage Streaming ledger, which adds serializing ACID transaction capability to Flink

A checkpoint feature allows stopping a running application and restarting at the saved state. Flink supports exactly once guarantee with the use of distributed snapshots. With the distributed snapshot and state checkpointing method of achieving exactly once, all the states for each operator in the streaming application are periodically checkpointed. If a failure occurs anywhere in the system, all the states for every operator are rolled back to the most recently globally consistent checkpoint. During the rollback, all processing is paused. Sources are also reset to the correct offset corresponding to the most recent checkpoint. The whole streaming application is rewound to its most recent consistent state, and processing can then restart from that state.

72 Working with Pravega Streams

Flink draws a consistent snapshot of all its operator states periodically (its checkpoint configuration intervals). Flink then stores the state information in a reliable distributed state backend storage. By doing so, it allows the Flink application to recover from any failures by restoring the application state to the latest checkpoint.

Flink includes APIs that can process continuous near-real-time data, sets of historic data, or combinations of both.

For additional information about Flink streaming analytics applications, see the section in this guide: Working with Apache Flink Applications. Also see Pravega Flink Connector.

Understanding checkpoints and savepoints A savepoint is a consistent image of the execution state of a streaming job, created using the Apache Flink checkpointing mechanism. You can use savepoints to stop-and-resume, fork, or update Flink jobs.

Savepoints consist of two parts: a directory with binary files on storage and a metadata file.

The files on storage represent the net data of the execution state image of the job. The metadata file of a savepoint contains pointers to all files on storage that are part of the savepoint in the form of

absolute paths.

Conceptually, Flink savepoints are different from checkpoints in a similar way that backups are different from recovery logs in traditional database systems.

The primary purpose of checkpoints is to provide a recovery mechanism if there is an unexpected job failure. Flink manages the checkpoint life cycle. A checkpoint is created, owned, and released by Flink without user interaction.

In contrast, savepoints are created, owned, and deleted by the user.

Each time the application is stopped, its state is recorded in a Flink savepoint. Restarting the job deploys a new Flink job to the cluster, restoring the state from the savepoint.

NOTE: Since all clusters within a SDP analytics project have access to the savepoints, the applications can be restarted

from any cluster.

About Pravega scopes A Pravega (project) scope is a collection of streams. Pravega scopes are automatically created when someone in the platform administrator role creates an analytics project using

the SDP When a Pravega scope is created in SDP, a new Kubernetes service-instance of the same name is also created. The

service-instance provides the means for access control to the scope and its streams through Keycloak security.

Accessing Pravega scopes and streams

Use these steps to list the Pravega scopes and streams for analytics projects that you have been granted access to as a project member in SDP.

Steps

1. Log on to the SDP. 2. Click the Pravega icon in the banner to view the project scopes and streams to which you have been granted access.

SDP administrators:

Refer to the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/ docu103271 for information about configuring project scopes and adding project-members to a Pravega scope.

About cross project scope sharing

A Pravega scope can be shared in read-only mode with one or many other projects.

The Streaming Data Platform cross project scope sharing feature permits applications running in one project to have READ access to scopes from other projects. For example, a team working on ingesting data, such as stock prices, might be different from the analysts who analyze that same streaming data. Also, several distinct data analyst teams might want to consume that

Working with Pravega Streams 73

data but do not necessarily want to share their Flink or Spark applications or Maven artifacts. The cross project scope sharing feature allows these objects to remain hidden from other project members.

Platform administrators manage cross project scope sharing in the SDP UI, granting and removing project scope READ access to project members. When access is shared, applications in one project have READ access to streams in the shared scopes.

For more information about how administrators can configure cross project scope sharing, see the "Manage cross project scope sharing" section of the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/ docu96952_Streaming_Data_Platform_1.1_Installation_and_Administration_Guide.pdf.

Configuring applications to read a specific stream or set of streams from a shared scope

Reader applications that run in the context of a Pravega scope are run as a project-wide service account and can consume Pravega stream data in that scope. Applications must be configured to read a specific stream, or set of streams, from that shared scope.

First, a platform administrator grants shared access at the scope level. Then authorizations are created for the project member's applications to read the specified project scope and streams that are associated with that scope.

For example, the following Flink reader application is coded with:

The fully qualified stream name for the source (marketdata-0\stockprices-0)

The current project scope (compute-0) for the reader group storage/reader synchronization state

Figure 22. Flink connector reader app example code: Ingestion and reader apps shared between different projects

In the SDP UI, the compute-reader-0 single reader application reads data from the Pravega streamName marketdata-0\stockprices-0 stream is shown.

Figure 23. Example in SDP: Ingestion and reader apps shared between different projects

Troubleshooting

If your application reader is failing with PERMISSION DENIED errors, have a platform administrator perform the following:

1. Check that the cross project binding is present and is not in a failed state.

74 Working with Pravega Streams

a. Check the SDP UI to ensure that cross project access is fully granted. b. Log in to SDP and go to the Pravega tab. c. Click the scope that should be shared. d. Click Manage cross project access to verify that the target project to be shared appears selected in the multiselect

drop-down menu. e. Log in to the console as an admin, and use the kubectl CLI to check that all service bindings in the system show

Ready by running

kubectl get servicebindings -n nautilus-pravega 2. Check the Keycloak admin console by going to the Evaluate tool under the Authorization tab of the Pravega-controller

client.

About Pravega streams A Pravega stream is an unbounded stream of bytes or stream of events. Pravega writer RESTful APIs write the streaming data to the Pravega store.

Before applications can read or write streams to Pravega, streams must be defined and configured in the SDP. These functions are most easily accomplished by creating a stream in the UI, but they can also be created manually.

A stream's full name consists of scope-name/stream-name. Therefore, a stream name must be unique within its scope and conform to Kubernetes naming conventions.

Users with the platform administrator role or project-members can create streams within existing scopes in the UI or manually using kubectl commands.

Users can select existing streams or define new ones for the project. Streams defined this way automatically appear in the list of streams under the project scope in the Pravega section of the UI.

Streams defined within a project-specific scope can only be referenced by applications in that project.

See the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271 for information on creating scopes, resource management, and administrative tasks to ensure that adequate storage and processing resources are available to handle stream volumes and analytics jobs.

Create data streams Both platform administrators and project-members can create data streams. All stream names must conform to Kubernetes naming conventions.

A stream name must be unique within the scope. All user-provided identifiers in the SDP must conform to the Kubernetes naming conventions as follows:

Names must consist of lowercase alphanumeric characters or a dash (-) Names must start and end with an alphanumeric character

Create a stream from an existing Pravega scope

You can create a data stream from an existing scope. Platform administrators must first create the scope before project- members can create streams.

Steps

1. Log in to the SDP.

2. Click the Pravega icon in the banner.

A list of existing scopes appears.

3. Click a scope.

A list of existing streams in that scope appears.

4. Click Create Stream.

Working with Pravega Streams 75

5. Complete the configuration screen as described in Stream configuration attributes.

Red boxes indicate errors.

6. When there are no red boxes, click Save. The new stream appears in the Pravega Streams table. The new entry includes the following available actions: Edit Use this action to change the stream configuration. Delete Use this action to remove the stream from the scope.

Stream access rules

Streams are protected resources. Access to streams is protected by the Pravega scope and project membership.

Stream access is protected as shown in the following table:

Access Description

Scope access If the scope was created in the Pravega section of the SDP UI, it has a list of project-members associated with it. This is a project-independent scope. and only those project-members and platform administrator users listed can access the streams in the scope.

Project membership If the scope was created under a project and an application in the Analytics section of the UI, it has project-members associated with it.

Only project-members and platform administrator users can add and view streams in the scope.

76 Working with Pravega Streams

Stream configuration attributes The following tables describe stream configuration attributes, including segment scaling attributes and retention policy.

General

Property Description

Name Identifies the stream. The name must be unique within the scope and conform to Kubernetes naming conventions. The stream's identity is:

scopename/streamname

Scope The scope field is preset based on the scope you selected on the previous screen and cannot be changed.

Segment Scaling

A stream is divided into segments for processing efficiency. Segment scaling controls the number of segments used to process a stream.

There are two scaling types: Dynamic With Dynamic scaling, the system determines when to split and merge segments for optimal performance.

Choose dynamic scaling if you expect the incoming data flow to vary significantly over time. This option lets the system automatically create additional segments when data flow increases and to decrease the number of segments when the data flow slows down.

Static In Static scaling, the number of segments is always the configured value. Choose static scaling if you expect a uniform incoming data flow.

You can edit any of the segment scaling attributes at any time. It takes some minutes for changes to affect segment processing. Scaling is based on recent averages over various time spans, with cool down periods built in.

Scaling type Scaling attributes Description

Dynamic Trigger Choose one of the following as the trigger for scaling action: Incoming Data Rate Looks at incoming bytes to determine when segments

need splitting or merging. Incoming Event Rate Looks at incoming events to determine when

segments need splitting or merging.

Minimum number of segments

The minimum number of segments to maintain for the stream.

Segment Target Rate

Sets a target processing rate for each segment in the stream. When the incoming rate for a segment consistently exceeds the specified

target, the segment is considered hot, and it is split into multiple segments. When the incoming rate for a segment is consistently lower than the specified

target, the segment is considered cold, and it is merged with its neighbor. Specify the rate as an integer. The unit of measure is determined by the trigger choice. MB/sec when Trigger is Incoming Data Rate. Typical segment target rates are

between 20 and 100 MB/sec. You can refine your target rate after performance testing.

Events/sec when Trigger is Incoming Event Rate. Settings would depend on the size of your events, calculated with the MB/sec guidelines above in mind.

To figure out an optimal segment target rate (either MB/sec or events/sec), consider the needs of the Pravega writer and reader applications. For writers, you can start with a setting and watch latency metrics to make

adjustments.

Working with Pravega Streams 77

Scaling type Scaling attributes Description

For readers, consider how fast an individual reader thread can process the events in a single stream. If individual readers are slow and you need many of them to work concurrently, you want enough segments so that each reader can own a segment. In this case, you need to lower the segment target rate, basing it on the reader rate, and not on the capability of Pravega. Be aware that the actual rate in a segment might exceed the target rate by 50% in the worst case.

Scaling Factor Specifies how many colder segments to create when splitting a hot segment.

Scaling factor should be 2 in nearly all cases. The only exception would be if the event rate can increase 4 times or more in 10 minutes. In that case, a scaling factor of 4 might work better. A value higher than 2 should only be entered after performance testing shows problems.

Static Number of segments

Sets the number of segments for the stream. The number of segments used for processing the stream does not change over time, unless you edit this attribute. The value can be increased and decreased at any time.

We recommend starting with 1 segment and increasing only when the segment write rate is too high.

Retention Policy

The toggle button at the beginning of the Retention Policy section turns retention policy On or Off. It is Off by default.

Off (Default) The system retains stream data indefinitely. On The system discards data from the stream automatically, based on either time or size.

Retention Type Attribute Description

Retention Time Days The number of days to retain data. Stream data older than Days is discarded.

Retention Size MBytes The number of MBytes to retain. The remainder at the older end of the stream is discarded.

List streams (scope view) List the Pravega streams for the project scopes you have access to.

Steps

1. Log on to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A list of existing streams in that scope displays.

78 Working with Pravega Streams

View streams View Pravega stream writers and readers for the project scopes you have access to.

Steps

1. Log on to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A view of existing writers/readers, transactions, and segment scaling displays.

With data flowing through, streaming data with no transactions looks similar to the following in the SDP UI (this is a truncated view).

Working with Pravega Streams 79

80 Working with Pravega Streams

Edit streams Edit data streams using the SDP UI.

Steps

1. Log on to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A list of existing streams in that scope appears.

3. Click the stream name.

4. On the Create Stream view, edit the configuration screen as detailed in the Stream configuration attributes section.

5. After editing the settings, wait until there are no red boxes and then click Save. The new stream displays and you have the following available actions: EditChange the stream configuration. DeleteRemove the stream from the scope.

6. Click the stream name to view the Pravega stream.

Delete streams Delete a Pravega stream using the SDP UI.

Steps

1. Log in to the SDP and click the Pravega icon in the banner. A list of scopes to which you have access appears.

2. Click the name of a scope. A list of existing streams in that scope appears.

3. Click the stream name, and then click Delete under the Actions column.

Working with Pravega Streams 81

Start and stop stream ingestion Native Pravega, Apache Flink, or Apache Spark applications control stream ingestion.

The SDP creates and deletes scope and stream entities and monitors various aspects of streams. The system does not control stream ingestion.

Monitor stream ingestion Monitor performance of stream ingestion and storage statistics using the Pravega stream view in the SDP UI.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name.

View the following:

Ingestion rates General stream parameter settings Segment heat charts showing segments that are hotter or colder than the current trigger rate for segment splits. For

streams with a fixed scaled policy, the colors on the heat chart can indicate ingestion rates. The redder the segment is, the higher the ingestion rate.

About Grafana dashboards to monitor Pravega Only users in the platform administrator role can access Grafana dashboards.

Stream metrics displayed on the Pravega dashboard include throughput, readers and writers per stream (byte and event write rates as well as byte read rates), and transactional metrics like commits and cancels. Reader and writer counts, which would be the application-specific metrics, are not available.

For more information about Grafana dashboards, Pravega, and administrator access, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271.

82 Working with Pravega Streams

Working with Pravega Schema Registry Pravega schema registry stores and manages schemas for the unstructured data (raw bytes) stored in Pravega streams. It allows application developers to define standard schemas for events, share them across an organization, and evolve them in a safe way.

Topics:

About Pravega schema registry Create schema group List and delete schemas Add schema to streams Edit schema group and add codecs

About Pravega schema registry Pravega schema registry service is not limited to data stored in Pravega. It can serve as a general-purpose management solution for storing and evolving schemas in a wide variety of streaming and non-streaming use cases. Along with providing a storage layer for schemas, the schema registry service stores and manages encoding information in the form of codec information. Codecs can correspond to different compression or encryption that is used while encoding the serialized data at rest. The service generates unique identifiers for schemas and codec information pairs that users may use to tag their data.

Pravega schema registry stores schemas in a logical concept, known as a schema group. Application developers create the group in the Streaming Data Platform (SDP) with a set of compatibility policies, which store a versioned history of all schemas. Policies allow the evolution of schemas according to the configured compatibility settings.

The registry service also provides a RESTful interface to store and manage schemas under schema groups. Users can evolve their schemas within the context of the schema group based on schema compatibility policies that are configured at a group level.

The registry service supports serialization formats in Avro, Protobuf, and JSON schemas. Users can also store and manage schemas from any serialization system. The service allows users to specify wanted compatibility policies for evolution of their schemas, but these policies are employed only for the natively supported serialization systems.

The SDP installer automatically installs the schema registry Helm chart and service accounts. The security configurations are the same as are used in Pravega.

Schema registry in SDP

This guide provides an overview of the registry service and describes Pravega schema registry in the context of its usage within the SDP UI. Management of the registry service is not supported from within the SDP UI, therefore Pravega schema management is not documented in this guide. For detailed information about schema management, see: https://github.com/ pravega/schema-registry/wiki/PDP-1:-Schema-Registry .

Application developers can use the registry service in multiple ways to fit their requirements. Developers can create and upload schemas that are associated with Pravega streams using the SDP UI into a cluster registry. Developers can configure applications to read shared schemas between multiple versions.

Schema management in the SDP is performed in the context of a Pravega stream and has the following broad phases in the SDP.

1. Create schema group. a. Add schema group to an existing Pravega stream. b. Add schemas and codecs to the schema group.

2. List schemas. a. List schemas and codecs that are used for a Pravega stream.

3. Edit schema group.

7

Working with Pravega Schema Registry 83

a. Update compatibility policy. b. Add schemas. c. Add codecs.

4. Delete schema group.

Schema registry overview

Schema registry can be used as a general-purpose service for centrally storing, serving, and managing schemas for a host of services. Each service instance is stateless, and all stream schemas and metadata are durably stored in Pravega Tables. Each service instance can process all queries.

NOTE: Only JAVA applications can currently use schema registry as of SDP 1.2.

As shown in the following figure, the schema registry is decoupled from Pravega. Pravega applications that are using both the registry service and schema storage link schema registry and Pravega. This linkage allows Pravega to be optimized as a streaming storage system for writing and reading raw bytes. In addition, the schema registry REST APIs ensure that its scope is not limited to Pravega use cases.

Schema groups

Conceptually, schema registry stores a versioned history of schemas that are scoped to a logical group. A schema group is a logical collection where schemas for a Pravega stream are stored. A group is the top-level resource that is supported by the registry and is essentially a name under which all related schemas are grouped.

Application developers can create a schema group in the SDP UI identified by a unique group name. Group names can include an optional qualifier called namespace. Each namespace and group combination uniquely identifies a resource in schema registry.

A schema group is analogous to a Pravega stream. Similarly a namespace is analogous to a Pravega scope. Each stream, for which users intend to manage schemas, creates a corresponding group with the same name (groupname =

stream name and namespace = scope name). All schemas defining the structures of data in the Pravega stream are stored and evolved under this group.

The schema registry service manages the sharing of schemas between components so that schemas that are used in Pravega streams can evolve over time. The schema serialization and compatibility framework supports schema evolution, allowing applications to use different schemas during writes and reads as follows:

Serialization binary formats (for how the data is transmitted and stored) include Avro, Protobuf, JSON schemas, and custom formats.

Serialization and deserialization are handled behind the scenes in the SDP.

84 Working with Pravega Schema Registry

Schema registry SDK libraries provide the serializers. An application instantiates the serializer. All activities of communicating with schema registry are hidden from users on both the READ and WRITE side. As events are deserialized, the structure is returned to the streaming application.

For detailed information about how schema registry works behind the scenes, see: https://github.com/pravega/schema- registry/wiki/PDP-1:-Schema-Registry

Configuration compatibility settings include AllowAny and DenyAll.

The schema registry service includes:

Centralized registryAllows reusable schemas between streaming data applications. Version managementStores a versioned history and allows schema versions to evolve with the data over time.

A version number identifies each schema. Schemas are backwards, forwards, and bidirectionally compatible. As schemas are evolved, they are captured in a monotonically increasing chronological version history.

Schema validationEnsures that downstream applications do not break due to incompatible data formats or encoding. The service divides schemas into logical validation groups and validates schemas against previous schemas that share the same type in the history of the group.

Message encodingThe schema registry service generates unique identifiers for message encoding. Users can register different codecTypes under a group, and for each unique codecType and schema pair, the service

generates unique encoding identifiers.

Support for multiple types

Schema registry allows for writing events of multiple types into the same Pravega stream. An example use case is event sourcing, where applications might try to record their events describing different actions in a specific order. Pravega streams provide strong consistency and ordering guarantees for the events that are written to Pravega.

The schema registry service does not impose a restriction on schemas within a group to be of a single type. Instead, Pravega applications can register schemas under the "type" and evolve schemas within this type.

The service divides schemas into logical validation groups and validates schema against previous schemas in the history of the group that share the same type. For example, if a stream had events of type UserAdded, UserUpdated, UserRemoved, you register schemas under three

different types within the same group. Each type evolves independently. The service assigns unique identifiers for each schema version in the group. The service also assigns type-specific version numbers for the order in which they were added to the group.

Schema terminology

Schema Schema describes the structure of user data.

Schema binary Schema binary is the binary representation of the schema. For Avro, Protobuf, and JSON, the service parses this binary representation and validates the schema

to be well-formed and compatible with the compatibility policy. The schema binary can be up to 8 MB in size.

Codecs Schema registry includes support for registering codecs for a group. Codecs describe how data is encoded after serializing it. Typical usage of encoding would be to

describe the compression type for the data. CodecType defines different types of encodings that can be used for encoding data while writing it to

the stream. The schema registry serializer library has codec libraries for performing gzip and Snappy

compression.

Schema event types

Schema registry allows multiple types of events to be managed under a single group. Any compatibility option allows you to use multiple data or event types within the same stream.

Each event or object type is a string that identifies the object that the schema represents. A group allows multiple types of schemas to be registered, evolves schemas of each type, and assigns

them versions independently.

Schema compatibility

Schemas are considered compatible if data written using one schema can be read using another schema.

The registry defines rules for how the structure of data should be evolved while ensuring business continuity.

Working with Pravega Schema Registry 85

Compatibility policies determine how schemas are versioned within a group as they evolve. Compatibility types can be backward, forward, or bi-directional.

Backward compatibility allows you to use a schema policy that is older. Forward compatibility allows you to use a newer schema policy.

The service includes the options AllowAny and DenyAll. The AllowAny option allows any changes to schemas without performing any compatibility checks. The DenyAll option prevents new schema registration.

The compatibility policy choice is only applicable for formats that the service implicitly understands. So for formats other than Avro, JSON, and Protobuf, the service only allows the policy choice of AllowAny or DenyAll.

Schema encoding identifier

A schema encoding identifier is a unique identifier for a combination of schema version and codec type.

Users can request the registry service to generate a unique encoding ID (within the group) that identifies a schema version and codecType pair. The service generates a unique 4-byte identifier for the codecType and schema version.

The service indexes the encoding ID to the corresponding encoding information and also creates a reverse index of encoding information to the encoding ID.

Schema evolution Schema evolution defines changes to the structure of the data. Schema evolution allows using different schemas during writes and reads.

Schema object type

The type of object that the schema represents. For example, if a schema represents an object that is called "User," the type should be set to "User."

The service within the type context User evolves all schemas for User.

Schema group A logical collection where schemas for a Pravega stream are registered and stored. A schema group is a resource where all schemas for all event types for a stream are organized and

evolved. Unlike Pravega streams, which are created under a namespace that is called a scope, an SDP Schema

Registry group has a flat structure. A group ID identifies each group. For Pravega streams, this group ID is derived from the fully qualified scoped name of the Pravega stream. The group name is the same as the Pravega stream, and the namespace is the same as the Pravega scope or project-name.

Each Pravega stream can map to a group in a registry service, however, a group can be used outside the context of Pravega streams as well.

Schema group namespace

The namespace is analogous to the Pravega scope. Namespace is optional and can be provided at the time of group creation.

Schema group properties

Each group is set up with a configuration that includes the serialization format for the schemas and the wanted compatibility policy. Only schemas that satisfy the serialization format and compatibility policy criteria for a group can be registered to it.

Schema identifier When the service registers a schema, it is assigned a unique 4-byte ID to identify the schema within the group. If there is only a single type of object within the group, then the schema ID is identical to the

version number. If there are multiple object types, the schema ID represents the same schema that is represented

by "type + version number." The schema ID is unique for a group and is generated from a monotonically increasing counter in the

group. If a group has "n" schemas, adding a new schema has the ID "n+1." If the same schema is added to multiple groups, it can have different IDs assigned in each group.

Schema information

Schemas that are registered with the service are encapsulated in an object called SchemaInfo that declares the serialization format, type, and schema binary for the schema. The serialization format describes the serialization system where this schema is to be used.

Schema metadata

Metadata that is associated with a named schema.

Schema metadata properties map

User-defined additional metadata properties that are assigned to a group as a String/String map. These values are opaque to the service, which stores and makes them available. Each key and value string can be up to 200 bytes in length.

86 Working with Pravega Schema Registry

Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

Schema version Each update to a schema is uniquely identified by a monotonically increasing version number. As schemas are evolved for a message type, they are captured as a chronological history of schemas.

Schema serialization format

Serialization format can be one of Avro, Protobuf, JSON, or a custom format. The schema registry service has parsers for Avro, Profobuf, and JSON schemas and validates the

supplied schema binary to be well formed for these known formats. When you specify a custom format, you must specify a custom type name for the format. The

service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Schema version info or version number

If a schema is registered successfully, the service generates a schema ID that identifies the schema within the group. It also assigns the schema a version number, which is a monotonically increasing number representing the version for the schema type.

As a new schema is registered, it is compared with previous schemas of the same type. The new schema is assigned a version number in the context of the type. For example, the first schema for object User is assigned version 0, and the nth schema for object User is assigned version n.

The schema ID, object type, and version number are encapsulated under a VersionInfo object for each registered schema.

Schema Registry REST APIs

Pravega schema registry exposes a REST API interface that allows developers to define standard schemas for their events and evolve them in a way that is backwards compatible and future proof. Schema registry stores a versioned history of all schemas and allows the evolution of schemas according to the configured compatibility settings. You can use the APIs to perform all the activities for the schema and groups if you have the prerequisite admin permissions.

There are two chief artifacts that Pravega applications interact with while using schema registry: the REST service endpoint and the registry-aware serializers. Any non-Pravega stream actions on schema registry can be performed using schema registrys REST APIs directly, either programmatically or through the schema registry admin CLI.

Application development and programming are beyond the scope of this guide. For more information about streaming data application development, see the following repository:

Pravega schema registry REST APIs: https://github.com/pravega/schema-registry/wiki/REST-documentation REST API Usage Examples: https://github.com/pravega/schema-registry/wiki/REST-API-Usage-Samples

Error codes

The following error codes may be returned to your running application:

500: Internal service error

Typically a transient error, such as intermittent network errors. Typical causes are connectivity issues with Pravega. Ensure that Pravega is running and that the schema registry service is running and is configured correctly.

If the issue persists, look for the registry service logs to find WARN or ERROR level logs. Run kubectl logs -n nautilus-pravega 400: Bad argument

Typically due to an ill-formed schema file.

401: Unauthorized

402: Forbidden

Working with Pravega Schema Registry 87

Create schema group Project-members who have been granted access to a Pravega scope can create schema groups.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name and then click the Create Schema link. The Create Schema Group screen displays.

Where:

Section Description

General Group Name

Name of an existing Pravega scope or project-name.

Configuration Serialization Format

The serialization format can be one of Avro, Protobuf, JSON, Any, or a custom format. Schema registry service validates the supplied schema binary for these known formats.

Serialization supports schema evolution, which allows applications to use different schemas during writes and reads. When choosing a custom format, specify a custom type

name for the format. The service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Compatibility

Compatibility policies for schema evolution include: Backward, Forward, and bi-directional, such as BackwardTransitive, ForwardTransitive, BackwardTill, ForwardTill, Full, FullTransitive, BackwardTillAndForwardTill, AllowAny, DenyAll.

88 Working with Pravega Schema Registry

Section Description

The compatibility policy choice is only applicable for formats that the service implicitly understands. For formats other than schema group types of Avro, JSON, and Protobuf, the registry service only allows the policy choice of AllowAny or DenyAll.

The AllowAny option allows any changes to schemas without performing any compatibility checks.

The DenyAll option prevents new schema registration.

Allow Multiple Serialization Formats

Enable the Allow Multiple Serialization Formats option to support multiple serialization formats and allow for custom schema formats. This allows multiple schemas to co-exist within a group, for example, event sourcing support.

Properties Click Add Property to add user-defined metadata Key/ Value pairs that are assigned to a group as a string/string map. Each key and value string can be up to 200 bytes in

length. Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

3. Click Save to save the schema group configuration settings. Click Cancel to return to the streams list.

List and delete schemas Project members with Pravega scope access can list and delete existing schemas for that stream.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name and then click Edit Schema.

The Edit Schema Group screen displays with the names of schemas, serialization format, properties, creation date, and version.

Working with Pravega Schema Registry 89

3. You can also delete schemas from this screen by clicking the Delete link for the schema group under the Actions column. Follow the UI prompts to delete the schema.

Add schema to streams Add schema to existing Pravega streams.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name and then click the name of an existing Pravega stream to which you want to add a schema.

3. On the Streams view, click the Create Schema link to add schema to a stream.

4. The Create Schema Group screen displays with the following fields. Complete the configuration and then click Save.

Section Description

General Group Name

Name of an existing Pravega scope or project-name.

Configuration Serialization Format

The serialization format can be one of Avro, Protobuf, JSON, Any, or a custom format. Schema registry service validates the supplied schema binary for these known formats.

Serialization supports schema evolution, which allows applications to use different schemas during writes and reads. When choosing a custom format, specify a custom type

name for the format. The service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Compatibility

Compatibility policies for schema evolution include: Backward, Forward, and bi-directional, such as BackwardTransitive, ForwardTransitive, BackwardTill, ForwardTill, Full, FullTransitive, BackwardTillAndForwardTill, AllowAny, DenyAll.

The compatibility policy choice is only applicable for formats that the service implicitly understands. For formats other than schema group types of Avro, JSON,

90 Working with Pravega Schema Registry

Section Description

and Protobuf, the registry service only allows the policy choice of AllowAny or DenyAll.

The AllowAny option allows any changes to schemas without performing any compatibility checks.

The DenyAll option prevents new schema registration.

Allow Multiple Serialization Formats

Enable the Allow Multiple Serialization Formats option to support multiple serialization formats and allow for custom schema formats. This allows multiple schemas to co-exist within a group, for example, event sourcing support.

Properties Click Add Property to add user-defined metadata Key/ Value pairs that are assigned to a group as a string/string map. Each key and value string can be up to 200 bytes in

length. Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

Edit schema group and add codecs Project-members with Pravega scope access can edit schema groups and add codecs.

Steps

1. Log in to the SDP.

2. Go to Pravega > scope-name > stream-name and then click the Schemas tab. The Edit Schema Group screen displays.

3. Click the Add Schema button.

4. In the Add Schema panel, enter an optional schema name, serialization format, schema file, and additional metadata properties.

Working with Pravega Schema Registry 91

Add Where:

Configuration field Description

Add Schema Name Name of an existing Pravega scope or project-name

Serialization Format

The serialization format can be one of Avro, Protobuf, JSON, Any, or a custom format. Schema registry service validates the supplied schema binary for these known formats.

Serialization supports schema evolution, which allows applications to use different schemas during writes and reads. When choosing a custom format, specify a custom type

name for the format. The service assigns the schema binary a unique ID, but it does not provide any schema validation or compatible evolution support for custom formats.

When the serialization format property is set to "Any," the group allows for multiple formats of schemas to be stored and managed in the same group.

Schema File Browse to an existing Schema File on your local

machine and then click Save. Simple example Avro schema file:

{ "type": "record", "name": "Type1", "fields": [ {"name": "a", "type": "string"}, {"name": "b", "type": "int"} ] }

You can view the schema by clicking the name link in the SDP UI after uploading the schema file.

Add Property

92 Working with Pravega Schema Registry

Configuration field Description

Add user-defined metadata Key/Value pairs that are assigned to a group as a string/string map. Each key and value string can be up to 200 bytes in

length. Example group properties:

SerializationFormat: "Avro", Compatibility:"backward", allowMultipleTypes:"false", properties:<"environment":"dev", "used in":"pravega">

5. Click Save when you are finished.

6. Select the Codecs tab on the Edit Schema Group screen, and then click the Add Codec button to register codecs for a schema group. The Add Codec panel displays.

7. Browse on your local machine to upload a codec library. SDP supports gzip or Snappy compression. Click Add Property to add user-defined metadata key/value pairs that are assigned to a group as a string/string map.

8. Click Save when you are finished.

Working with Pravega Schema Registry 93

Working with Pravega Search (PSearch)

Topics:

Overview Using Pravega Search Health and maintenance Pravega Search REST API

Overview Pravega Search (PSearch) provides search functionality against Pravega streams.

Pravega Search enables robust querying against the data stored in Pravega streams. Pravega Search supports two types of queries: Search queriesqueries against all data stored in the stream and indexed. Continuous queriesreal-time queries against data as it is ingested into the stream.

Pravega Search can query highly structured data and full text data.

Pravega Search understands the structure of the data in a stream based on mapping information you provide when you make a stream searchable. Using that structure definition, Pravega Search can process continuous queries and index all stored data in a stream. Indexing generates comprehensive Index documents, a type of metadata, that helps to efficiently locate data occurrences in response to search queries.

Both existing and new streams can be configured as searchable. On existing streams with a large volume of data, the initial indexing may take some time.

Relationship to Elasticsearch

NOTE: Elasticsearch, Logstash, and Kibana are trademarks of Elasticsearch BV, registered in the U.S. and in other

countries.

Similar to Elastic Search, Pravega Search builds a distributed search engine based on the Apache Lucene index. Pravega Search also provides a REST API similar to the ElasticSearch API, and works with Logstash and Kibana (OSS version) similar to the way that ElasticSearch works with them.

For querying, Pravega Search supports the following Elasticsearch query languages:

DQL as defined by Elasticsearch SQL as defined by Elasticsearch

See https://www.elastic.co/ for information about Elasticsearch.

Integration into SDP

Pravega Search is available only as part of SDP. It is not available with Open Source Pravega. SDP provides all Pravega Search functionality in addition to comprehensive deployment, management, security, and access to the PSearch REST API.

Pravega Search is implemented in the context of an SDP project. It runs in its own cluster associated with a project. A project has only one PSearch cluster.

Access to a PSearch cluster is based on project membership. Only project members and SDP Administrators can make streams searchable, submit queries, and view query results. Applications using the Pravega Search REST API must obtain project credentials.

With Pravega Search deployed in its own separated clusters, the stream indexing and query processing does not affect the efficiency of other stream processing functions. Each PSearch cluster has its own CPU and memory resources. The resources

8

94 Working with Pravega Search (PSearch)

used for stream ingestion, stream storage, and analytic operations are not affected by the volume or timing of indexing and querying.

Pravega Search maintains index metadata in an infinitely expanding series of index documents, stored in system-level Pravega streams. SDP manages the index documents and all the indexing worker threads that perform the indexing and query processing.

Autoscaling for Pravega Search resources is built into SDP.

What's included with Pravega Search

Pravega Search licensing includes the following:

Pravega Search functionality is enabled in the SDP UI. Pravega Search is deployed separately in each project, in project-specific PSearch clusters. The Pravega Search REST API is available. Each PSearch cluster has a unique REST head for access. Kibana is deployed in each project-specific PSearch cluster. A Pravega Search metrics stack (InfluxDB and Grafana dashboards) is deployed in each project-specific PSearch cluster.

Two query types

Users can submit search queries on the stored data that has finished indexing. They can register continuous queries on real-time data.

SDP supports two types of Pravega Search queries. Each query type solves a different type of use case.

Search queries Search queries search indexed data. In an optimized and correctly sized Streaming Data Platform, indexed data is available soon after ingestion. In addition to searching the stream, these queries can perform aggregations.

Search queries are submitted using REST API calls or using the embedded Kibana UI.

An example use case for a search query is to collect aggregated data over a time range, such as the average machine temperatures over the last ten minutes. A custom application or Kibana could monitor the averages and catch a developing abnormal condition.

Continuous queries

A continuous query is applied in real time against stream data as the data is ingested. Continuous queries can act as a filter, eliminating unwanted data, or perform data enrichment. Continuous queries can write matching, filtered, or enriched (tagged) data to output streams.

Project members can register continuous queries against a stream in the SDP UI or by submitting REST API calls. A query is registered once and runs continuously. The Pravega Search cluster scales up and down automatically based on real-time load.

Use cases for continuous queries include : Alerting, based on system logs, device data, or transactions Monitoring and enriching (tagging) original data Analytics and aggregrations

An example use case for a continuous query is to search for an anomoly condition on machinery, such as exceeding a temperature threshold. A query that defines the threshold is applied against the incoming stream of data. An output stream collects the matches, and a custom application can analyze the output stream.

Additional Pravega Search functionality

Here are some additional Pravega Search features.

Seamless installation and upgrades

Pravega Search is a fully integrated component of SDP. The Pravega Search software is included by default in the SDP distribution and is installed with the initial SDP deployment. Licensing determines whether Pravega Search features are available for use.

SDP upgrade images include the latest supported Pravega Search and Pravega upgrades. Installers do not need to check component version compatibility when performing SDP upgrades.

Working with Pravega Search (PSearch) 95

Alerts and health Pravega Search produces operational metrics that are collected, aggregated, and visualized in the project-level analytic metrics provided by SDP. See About analytic metrics for projects on page 63 for usage information.

Pravega Search generates alerts based on predefined rules that are forwarded to KAHM and SRS, as described in the Dell EMC Streaming Data Platform Installation and Administration Guide at https:// dl.dell.com/content/docu103271.

Each Pravega Search pod maintains a service log. The logs are stored in the Kubernetes persistent volume store.

Autoscaling Pravega Search automatically scales its components. Scaling is infinite and automatic within the resources available in the Pravega Search cluster. The number of Pravega Search service pods is autoscaled up or down depending on current processing needs.

REST API The Pravega Search REST API is included in the SDPSDP distribution. For more information, see Pravega Search REST API on page 116.

Limitations

The following limitations apply to Pravega Search in SDP 1.2. Automatic scaling in the Pravega Search cluster does not include the case in which the Pravega Search cluster runs out of

resources. In that case, you must manually add more nodes to the Pravega Search cluster. The number of continuous queries that you can register on a stream is limited. The limit is 10 continuous queries per stream. In a continuous query result stream, order is not guaranteed by default. The order may not be preserved due to parallel

processing. However, you can enable exactly-once mode over the searchable stream using the REST API.

Using Pravega Search The following table summarizes Pravega Search usage steps.

Table 4. Pravega Search procedures summary

Procedure Description Method

1 Create a Pravega Search cluster for the project.

There is one Pravega Search cluster per project. UI or REST API

2 Configure searchable streams.

You can configure new or existing streams as searchable. As part of this configuration, you provide the structure of the stream data, in JSON format.

UI or REST API

3 Register continuous queries. Continuous queries operate on incoming data in real time. An output stream stores the results of continuous queries.

UI or REST API

4 Submit search queries. Search queries operate on stored stream data and on incoming data as it finishes indexing.

Submit queries and view, analyze, save, and share results in Kibana.

Kibana or REST API

5 Make a stream not searchable any longer.

The index documents for that stream are removed. UI or REST API

Getting started

Pravega Search is implemented within the context of an SDP project. You must be a project member or SDP administrator to use Pravega Search.

To get started with Pravega Search in the SDP UI, click Analytics > > Search Cluster. NOTE: If the Search Cluster tab is not visible on a project page, it means that Pravega Search is not licensed for your

SDP implementation.

96 Working with Pravega Search (PSearch)

Figure 24. Location of Search Cluster tab in the UI

Create Pravega Search cluster

A Pravega Search cluster must be created in a project before you can index or search streams in the project.

Prerequisites

The project must exist and appear on the Analytics page in the UI. You must be a member of the project or an Administrator.

About this task

A Pravega Search cluster is a collection of processing nodes that are dedicated to Pravega Search indexing and searching functions. A project has only one Pravega Search cluster, which manages multiple searchable streams in the project and all querying on those streams.

The Pravega Search cluster is separate from the analytic clusters (for example, the Flink clusters) in the project. Adding and deleting a Pravega Search cluster does not affect analytic processing or stream data storage.

Steps

1. Go to Analytics > project-name.

2. Click Search Cluster.

NOTE: If the Search Cluster tab is missing, it indicates that Pravega Search is not licensed for this SDP deployment.

3. Click Create Cluster.

NOTE: If the Create Cluster button is missing, it indicates that a Pravega Search cluster is already deployed for the

project.

4. On the Cluster Size page, select the size of the cluster you want to create.

General guidelines are: SmallFor 1 to 10 searchable streams. MediumFor 10 to 50 searchable streams. LargeFor 50 to 100 searchable streams.

Working with Pravega Search (PSearch) 97

Customized sizes are possible. Contact Dell Customer Support.

As you click each size, the UI shows estimated resource usage, based on estimated number of services and pods that would be deployed.

5. Ensure the size that you want is highlighted and click Save. The Search Cluster tab contains a Processing message in the State column. Deployment may take some time.

6. When the State column shows Ready, you may continue to Make a searchable stream on page 98.

Results

Adding a Pravega Search cluster also creates a system-level Pravega scope and many system-level streams in the scope. These streams store the index documents that Pravega Search creates as it indexes data.

NOTE: These system-level scopes and streams are hidden from project members in the SDP UI. They are visible to

Administrators in the UI (on the Pravega page) and visible to all users on the Kubernetes command line. You should be

aware of their existence and purpose. Do not remove them.

Resize or delete a Pravega Search cluster

You can resize the cluster at any time without affecting the contents of the cluster. You may delete the cluster only if there are no searchable streams.

Prerequisites

The delete action is not implemented if there are any searchable streams in the project. The action returns a message indicating that you must first remove the searchable configuration from all the streams in the project before you can delete the Pravega Search cluster.

Steps

1. Go to Analytic Projects > project-name > Search Cluster > Cluster.

2. To resize the cluster:

a. On the cluster row, click Edit. b. On the Cluster Size page, select the new size. You may upsize or downsize the cluster. c. Click Save.

3. To delete the cluster:

a. On the cluster row, click Delete. b. If you receive a message indicating that there are searchable streams in the cluster, see Make searchable stream not

searchable on page 101.

Make a searchable stream

You can make an existing stream searchable or create a new stream and make it searchable.

About this task

A stream must be searchable before you can search it. When you make a stream searchable, the index is configured and the indexing process is initiated.

If there is existing data in the stream, Pravega Search indexes all the existing data. This may take some time, depending on how much data is in the stream. When this initial index processing is finished, all data in the stream is available for search queries.

New data ingested into the stream is indexed after all existing data are finished indexing.

A searchable stream has one index definition mapped to it. All the data in the stream is indexed according to its associated index definition.

Steps

1. Go to Analytic Projects > project-name > Search Cluster > Searchable Streams.

2. Click .

98 Working with Pravega Search (PSearch)

The Create Searchable Stream page appears.

3. In the Stream dropdown, select a stream name from the list of project streams, or select Create New Stream. If you select Create New Stream, the UI places you on the project's stream creation page. Create the new stream. When you are finished, the UI returns you to the Create Searchable Stream page, with the new stream name selected in the dropdown list.

4. Select Basic or Advanced configuration to configure shards.

A shard is a processing unit that works on indexing data. Pravega Search autoscales shards and other processing.

Option Description

Basic Starts index processing with default basic settings. As processing load increases, Pravega Search automatically scales the number of shards.

Advanced Starts index processing with a configured number of shards.

Advanced mode is suggested for users who have experience with Elasticsearch and Lucene, or who have consulted Dell EMC support. Otherwise, choose Basic.

5. In the Identifier field, type a unique identifier for the searchable stream.

This value identifies the searchable stream when a user performs queries. It is similar to index name in ElasticSearch. Each searchable stream must have a unique identifier.

6. In the Index Mapping text box, provide a JSON object named Properties that describes the structure of the data in the stream.

You may copy the JSON from elsewhere and paste it into the text box.

See PSearch index mapping format on page 99 for the JSON object requirements and syntax.

7. Click Save.

NOTE: If the Save button is shaded, it means that the Index Mapping text box does not contain valid JSON. Correct

the JSON to make the Save button active.

Results

When a stream is successfully made searchable, the following occurs:

The searchable stream appears on the Searchable Streams tab on the Project page. To view the indexing configuration of the stream, click the stream name.

Indexing of stream data starts. The indexing occurs in the background and is not visible in the UI.

PSearch index mapping format

An index mapping is required when you make a stream searchable. The index mapping defines the structure of the stream's incoming data.

Background information

Applications outside of SDP capture data as java objects, JSON, or other formats. Pravega Writer applications serialize the object data into byte streams for efficient injection, and then Pravega deserializes the data into its original object format for storage. The index mapping that is required when a stream is made searchable defines the structure of the incoming data for PSearch. The required format for the index mapping is JSON.

Based on the index mapping that you provide, Pravega Search indexes the existing data in the stream and continues to index all newly arriving data. Pravega Search also uses the index mapping to process any continuous queries that are registered for the stream.

Index mapping format

The index mapping is a JSON object named properties:

{ "doc": {"properties": { "field-name-1": { "type": "data-type"

Working with Pravega Search (PSearch) 99

}, "field-name-2": { "type": "data-type" }, "field-name-3": { "type": "data-type" } } } }

Each data-type must match the definition of the data in the Pravega Writer to prevent errors. For valid values for type, see https://www.elastic.co/guide/en/elasticsearch/reference/7.7/mapping-types.html#mapping-types.

Index mapping example

Consider a constant stream of CPU and memory usage measurements collected from many machines. A Pravega Writer application writes the measurements as events, using the machine name as event-id. Here is the index mapping:

Example index mapping

{ "doc": { "properties": { "machine_id": { "type": "keyword" }, "cpu_usage": { "type": "long" }, "mem_usage": { "type": "long" } } } }

Here is a sample event as written into the stream:

Example event contents

{ machine_id: machine-ubuntu-1, cpu_usage: 90, mem_usage: 80 }

Using the mapping, Pravega Search indexes the incoming stream data and can subsequently provide results to search queries about cpu and memory usage. It could also capture a continuous query intended to tag CPU and memory usage readings above some threshold values.

View searchable stream configuration

You can view the searchable stream configuration properties.

About this task

You cannot change the configuration of the searchable stream. Rather, you must make the stream not searchable, and then make the stream searchable again with a new configuration.

Steps

1. Go to Analytic Projects > > Search Cluster > Searchable Streams.

A list of the searchable streams in the project appears.

2. To view the searchable stream configuration, including the index mapping definition, click the stream name.

A popup window shows the current configuration.

You may copy the index mapping and other information from the popup window.

100 Working with Pravega Search (PSearch)

Make searchable stream not searchable

You can make a searchable stream not searchable.

About this task

NOTE: This action does not affect the stream data. It affects only the ability to index and search the stream.

Making a stream not searchable does the following: Deletes the index documents related to the stream. You can recreate index documents by making the stream searchable

again. Removes the index mapping definition. Removes the stream name from registered continuous query configurations, if any

Steps

1. Go to Analytics > > Search Cluster > Searchable Streams.

2. Optionally save the index mapping definition for future use with these steps:

a. Click the stream name. b. In the popup window that appears, copy the properties object and save it elsewhere.

c. Close the popup window.

3. Click Detach. The stream is detached from the Pravega Search cluster.

Continuous queries

A continuous query processes realtime data and provides results in realtime. The results are stored in an output stream.

You can register a continuous query in the SDP UI or by using REST API calls.

A continuous query may be registered for one or multiple searchable streams in the project. With exactly-once and strict ordering semantic guaranteed, every incoming event in a stream is processed by each continuous query in strict order. Each event is processed by all the queries that are registered against the stream.

Each continuous query has a result stream. Queries may share result streams. The result stream indicates whether the event matches the query or not. Two main usage strategies for result streams are:

1. FilterThe result stream contains only those events that matched one of the continuous queries that are registered against the stream.

2. Tag-The result stream contains events that matched at least one continuous query. The output event can have multiple tags on it.

Pravega Search and the schema registry

Pravega Search is integrated with the Pravega schema registry.

Schema registry provides interfaces for developers to define standard schemas for their events, share them across the organization and evolve them in a safe way. The schema registry stores a versioned history of all schemas associated with streams and allows the evolution of schemas according to the configured compatibility settings.

Pravega Search uses the schema registry to advantage in the following ways:

Pravega Search can support various serialization formats. The Pravega Search event reader can deserialize source stream events using the same deserializer that the event writer used when writing to the source stream.

Pravega Search can support various data schemas. Pravega Search can get a deserialized object encapsulated with its schema information, so that the deserialized object can be indexed, queried, and written to output streams with its original schema, regardless of what that schema is.

In the SDP UI, developers define schema groups to store variations of schemas. A Pravega stream is optionally associated with a schema group. For Pravega Search, the continuous query output streams can also optionally be associated with schema groups. You can explicitly assign the output stream to a schema group or allow Pravega Search to make assumptions and use default settings. For details, see Continuous query output streams on page 103 and Example: Schema groups with continuous query output streams on page 106.

Working with Pravega Search (PSearch) 101

Evolving schemas

Changing the schema of the input stream is transparent to Pravega Search. The Pravega readers automatically adapts to read the new events with the new schema.

Evolving schemas also have no impact on the output stream. A schema group can contain multiple schemas. When a new schema is found, it is append to the existing schema group for the output stream.

Register a continuous query

You can register a continuous query in the SDP UI. The query can operate against one or more streams in the project.

About this task

The continuous query page in the UI lets you define new streams and make existing streams searchable if you have not previously completed those steps.

A continuous query writes to a specified output stream. The continuous query page also lets you define a new stream for the output stream if you do not have one. The output stream does not have to be searchable. Multiple continuous queries can share the same output stream.

Steps

1. Go to Analytics > > Search Cluster > Continuous Queries.

2. Click Create Continuous Query.

3. In Name, type a unique name for this query.

4. In Query, type or paste the query statement.

See Continuous query construction on page 103 for information and examples.

5. In Searchable Streams, choose a stream from the drop-down list. You may also choose Create new stream. After choosing one stream, you many optionally choose additional streams. The list of streams that you select appears

horizontally in the field. When you choose Create new stream, the UI places you on the Searchable Streams page. You can make an existing

stream searchable or create a new stream and make it searchable. The UI returns you to this continuous query page when you are done.

6. In Output Stream, choose an existing stream from the list or create a new stream using Create new stream.

If you choose Create new stream, the UI lets you define the new stream and then returns you to this page to continue defining the continuous query.

The output stream may be a searchable stream but it does not have to be. It can be a normal stream.

A continuous query writes results to only one output stream.

Multiple continuous queries can share the same output stream, with limitations. See Continuous query output streams on page 103

7. In Output Stream Format, choose one of the following: TagThe output stream contains the events that matched the query. Events have a _tag field whose value is the

names of all continuous queries that matched that event.. This output format is typically used when multiple queries are associated with the same output stream and all results are written to the same output.

FilterThe output stream contains the events that matched the query, with no additional fields. Typical use cases for the filter option are: When there is only one query associated with one output stream. When there are multiple filter queries on one searchable stream, with the results for each redirected to different

output streams.

8. Click Save.

If Save is not enabled, it means that there is an error on the input page.

Check that all fields are completed. Check that the query format is valid.

Results

When Save is successful, the continuous query is active. If the input streams are ingesting data, the query is applied against that data.

102 Working with Pravega Search (PSearch)

Continuous query construction

Continuous queries are query statements that run against incoming data in real time.

The supported syntax for continuous queries in SDP v1.2 is ElasticSearch DSL version 7.7.0, as described here:https:// www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl.html.

Example

The following query is in DSL format. It searches for events whose CPU memory exceeds 95%.

{ "query": { "range": { "cpu": { "gt": 95 } } } }

Continuous query output streams

The results of a continuous query are captured in the output stream that you specify when you register the query.

The output stream format depends on the tag or filter option chosen.

Tag Pravega Search writes stream events that match a query to the output stream. The events are enriched with an additional field _tag, which is an array of tags. Tag values are the names of all the continuous queries that match the event.

Events without any tags are not written to the output stream.

Filter Pravega Search writes the stream events that match the continuous query to the output stream.

Schema versus json output

Source stream has a schema and output is Filter

If the query's source stream has a schema associated with it, and the continuous query uses the Filter output option, Pravega Search writes the output as serialized and associated with the same schema as the source file. This means that the source stream and output stream must be compatible as follows: If you associated a schema group to the output stream, then the properties in schema group for the

source stream must match the properties in the schema group for the output stream. The schema group properties are serialization format and a compatibility setting.

If the output stream does not have an associated schema group, then Pravega Search assigns the schema group ANY to the output stream and a compatibility of AllowAny.

Source stream does not have a schema or output option is Tag

If the query's source stream does not have a schema associated with it, or if the output stream uses the Tag option, Pravega Search writes the output events in json string format.

Limitations on sharing output files

Multiple continuous queries can write to the same output file, with the following limitations:

Multiple continuous queries with different output formats (FILTER and TAG) cannot share the same output stream. The CQ that breaks the rule is rejected at creation or update time.

Multiple continuous queries that are registered against different source streams, with some source streams associated with schemas and some not, cannot share the same output stream. The continuous queries that breaks the rule is rejected at creation or update time.

Working with Pravega Search (PSearch) 103

Edit a continuous query

You can edit a continuous query configuration.

About this task

You can change the following fields:

Query statement The list of streams that the query operates on The output stream

Steps

1. Go to Analytics > > Search Cluster > Queries.

2. In the table of existing queries, locate the query that you want to change and click Edit.

3. Make the changes, and click Save.

Results

The changes are effective on new ingested data shortly after the change is saved. The system requires only a small amount of time to react to the change.

Deregister a continuous query

Deregister a continuous query to stop the query.

Steps

1. Go to Analytics > > Search Cluster > Queries.

2. In the table of existing queries, locate the query that you want to change and click Delete.

3. Confirm the deletion.

Results

The continuous query is removed from the system and cannot be recalled. The output stream is still available.

Example: Continuous query using the REST API

This example uses the REST API to create a continuous query that filters data as it is ingested.

Prerequisites

This example assumes the following: A project is previously created. A stream named demoStream is previously created in the project.

A Pravega Search cluster is previously created in the project. The API call to authenticate to the cluster is previously submitted.

Steps

1. Declare a stream as searchable.

The following request declares demoStream as searchable. It creates an associated index named demoIndex. The incoming data is structured as one integer and one text field.

POST http://localhost:9098/_searchable_streams

{ "streamName": "demoStream", "indexConfig": { "name": "demoIndex",

104 Working with Pravega Search (PSearch)

"mappings": { "properties": { "key1": { "type": "integer" }, "key2": { "type": "text" } } } } }

2. Register a continuous query instance.

The query name is demoCQ. It examines incoming data for two conditions. It writes events that satisfy both conditions to a stream named outputStream. It filters out data that does not satisfy the conditions. All incoming data is written to the original stream.

The query sets up a boolean AND condition. The first data value must be greater than 100 and the second data value must contain the string "python".

NOTE: To set up a boolean OR condition, use the "should" keyword.

POST http://localhost:9098/_cqueries

{ "name": "demoCQ", "query": { "bool": { "must": [ { "range": { "key1": { "gt": 100 } } }, { "term": { "key2": "python" } } ] } }, "searchableStream": ["demoStream"], "outputStream": { "name": "outputStream", "format": "filter" } }

3. Write test events to the demoCQ stream.

One way to write test events into a stream is in Kibana.

a. In the SDP UI, go to Analytics > > Psearch cluster . b. Click Kibana. c. Using the API in Kibana, write event 1 to the demoCQ stream.

{ "key1": 110, "key2": "Tony loves python" }

d. Write event 2.

{ "key1": 90, "key2": "Tony loves python" }

e. Write event 3.

{ "key1": 110,

Working with Pravega Search (PSearch) 105

"key2": "Tony loves java" }

4. Look in the demoCQ stream.

In step 1, the demoCQ stream is declared searchable. Therefore, you can examine that stream in Kibana.

You should find all three events there.

5. Verify the contents of the output stream.

In step 1, the output stream was not declared searchable. Therefore, you can not examine that stream in Kibana. You can use a raw stream reader, or, log onto the SDP UI and make outputStream searchable.

Only the first event is in outputStream. The others were filtered out.

Event 1 matches both conditions. Event 2 does not match the GT 100 condition for key1. Event 3 does not match the contains python condition for key2.

Example: Schema groups with continuous query output streams

This example shows how to use the schema registry with query output files.

About this task

Steps

1. Create a project named schema-registry-test.

2. Create a stream in the project scope named schemaRegistryTestStream.

3. Create a schema group for the stream.

Choose the following settings: Serialization Format: Json Compatibility: AllowAny

4. Make schemaRegistryTestStream a searchable stream.

Name the index the same as the stream name. The names are not required to match, but it is convenient if they do.

5. Create another stream that is a continuous query output stream.

106 Working with Pravega Search (PSearch)

Name this stream schemaRegistryTestOutputStream.

6. Create a schema group for the output file, with the same settings that you used for the source stream.

NOTE: This step is optional. If you do not declare a schema group for the output file, the schema of the source stream

is used.

7. Create a continuous query that uses the input and output streams.

The query in this example matches all events:

{ "query": { "match_all":{} } }

8. Ingest data with schema.

Here is an excerpt from the Pravega writer.

SerializerConfig serializerConfig = SerializerConfig.builder() .registryClient(schemaRegistryClient) .groupId(schemaRegistryTestStream) .registerSchema(true) .build(); serializer = SerializerFactory.jsonSerializer(serializerConfig, JSONSchema.of(SchemaData.class)); EventStreamWriter writer = clientFactory.createEventWriter( schemaRegistryTestStream, serializer, EventWriterConfig.builder().build()); writer.writeTo(data);

9. Use Kibana to search data in the source stream and in the continuous query output stream.

The data is the same in both streams because the query was to match all.

Working with Pravega Search (PSearch) 107

Search queries

A search query searches all stored data in the stream that has finished indexing.

Each Pravega Search cluster deployment includes a Kibana instance. You can access the Pravega Search cluster's Kibana instance from the project page in the SDP UI. Then, use Kibana features to query the project's searchable streams, view the query results, analyze the results, and save the results.

Submit historical query and view results

This procedure describes how to go to the Kibana UI from the SDP UI and submit historical queries.

Prerequisites

Searchable streams must be defined in the cluster. There must be some data in the searchable streams. If a stream was made searchable, wait until the indexing is finished running on existing data. Otherwise, the query results are

incomplete. In a stream with a large amount of data already stored, indexing could take some time.

NOTE: You can view indexing progress using the stream watermark API in Kibana.

Familiarity with Kibana is helpful, or access the Kibana documentation at https://www.elastic.co/guide/.

Steps

1. In the SDP UI, go to Analytics > > Search cluster.

2. Click the Kibana tab.

The Kibana UI opens in a new window.

3. Create an index pattern that represents the streams to include in your historical query.

a. In the left panel, click Management > Kibana > Index Patterns. b. Click Create index pattern. c. Start typing in the index pattern field.

Kibana lists all the index mapping names in the search cluster.

The index names are the names that you used for the index mapping definitions when you created each searchable stream.

d. Define the streams to search as follows:

To search only one stream, select that stream's index mapping name. Use * as a wildcard in an index naming pattern. Use - to exclude a pattern. List multiple patterns or names that are separated with a comma and no spaces.

For example: test*,-test3

e. If Kibana detects an index with a timestamp, it asks you to choose a field to filter your data by time. A time field is required to use the time filter.

4. Submit queries.

a. In the left pane, click Discover. b. Select the index name pattern.

The index name pattern defines the streams to include in the search.

c. Optionally set a time filter.

The time filter defines the time range of data to include in the search.

d. Create the query.

See the Kibana documentation for supported query languages and syntax.

e. Examine the results, filter them, add and remove fields, save the search, and share the search.

See the Kibana documentation for instructions.

5. Create visualizations and dashboards.

a. In the left pane, click Dashboard. b. See the Kibana documentation for instructions to create, save, and share dashboards.

108 Working with Pravega Search (PSearch)

6. To return to the SDP UI, click the SDP tab in your browser.

The SDP and Kibana UIs are simultaneously available in separate tabs in your browser.

Example: Search query using Kibana

This example uses Kibana for search queries.

Prerequisites

This example assumes the following: A project is previously created. A stream that is named taxi-demo is previously created in the project and has been ingesting data over time.

The data in the stream looks like this:

{ "tolls_amount":"0", "pickup_datetime":"2015-01-07 14:58:07", "trip_distance":"1.58", "improvement_surcharge":"0.3", "mta_tax":"0.5", "rate_code_id":"1", "dropoff_datetime":"2015-01-07 15:10:17", "vendor_id":"2", "tip_amount":"1", "dropoff_location":[ -73.97346496582031, 40.78977584838867 ], "passenger_count":2, "pickup_location":[ -73.9819564819336, 40.76926040649414 ], "store_and_fwd_flag":"N", "extra":"0", "total_amount":"11.3", "fare_amount":"9.5", "payment_type":"1", "event_time":1607330906767 }

Steps

1. Create the Pravega Search cluster if it does not yet exist in the project.

a. In the SDP UI, go to Analytics > . b. Click Search Cluster. c. If a cluster name appears on the Cluster sub-tab, the search cluster already exists and you can go to the next step.

Otherwise, click Create Search Cluster.

d. Choose a size, and click Save. e. Wait until the Cluster state is Running.

2. Declare the stream searchable.

a. Click the Searchable Streams sub-tab. b. Choose taxi-demo from the dropdown list of streams.

c. Click Create Searchable Stream in the upper right. d. Name the index so that you recognize it in Kibana or accept the default name that shows in the field. e. Provide the index mapping. The following index mapping matches the data in the example above.

{ "properties":{ "surcharge":{ "type":"float" }, "dropoff_datetime":{

Working with Pravega Search (PSearch) 109

"type":"date", "format":"yyyy-MM-dd HH:mm:ss" }, "trip_type":{ "type":"keyword" }, "mta_tax":{ "type":"float" }, "rate_code_id":{ "type":"keyword" }, "passenger_count":{ "type":"integer" }, "pickup_datetime":{ "type":"date", "format":"yyyy-MM-dd HH:mm:ss" }, "tolls_amount":{ "type":"float" }, "tip_amount":{ "type":"float" }, "payment_type":{ "type":"keyword" }, "extra":{ "type":"float" }, "vendor_id":{ "type":"keyword" }, "store_and_fwd_flag":{ "type":"keyword" }, "improvement_surcharge":{ "type":"float" }, "fare_amount":{ "type":"float" }, "ehail_fee":{ "type":"float" }, "cab_color":{ "type":"keyword" }, "dropoff_location":{ "type":"geo_point" }, "vendor_name":{ "type":"text" }, "total_amount":{ "type":"float" }, "trip_distance":{ "type":"float" }, "pickup_location":{ "type":"geo_point" }, "event_time":{ "format":"yyyy-MM-dd HH:mm:ss || epoch_millis", "type":"date" } } }

f. Typically choose Basic configuration. Knowledgeable users can click Advanced and specify the initial shard number.

110 Working with Pravega Search (PSearch)

g. Click Save.

The stream appears in the list of Searchable Streams.

3. Click Kibana.

4. On the Kibana UI, in the index pattern field, select the index name that you specified when making the stream searchable.

Although it is possible to specify name patterns or create a list of index names, this example uses only one index name.

5. Create Kibana charts and dashboards.

See the Kibana documentation for details.

Health and maintenance You can monitor the Pravega Search cluster to ensure health and resource availability.

Monitor Pravega Search cluster metrics

If project metrics are enabled, the Pravega Search cluster has a metrics stack.

See About analytic metrics for projects on page 63 for information about collecting and viewing metrics.

See Edit the autoscaling policy on page 112 for a description of how the Pravega Search automatic scaling uses the collected metrics to trigger scaling actions.

Delete index documents

As indexes grow, you may want to delete index documents to reclaim space.

Use the Pravega Search REST API to delete unwanted index documents.

Scale the Pravega Search cluster

Most Pravega Search scaling occurs automatically based on configured values.

Services that consume resources

A Pravega Search cluster has three main functions: indexing incoming stream data, processing continuous queries, and responding to search queries. The servies in a Pravega Search cluster that perform those functions and consume resources are summarized here.

Table 5. Pravega Search cluster services

Service Description

Controller Manages the cluster and the indexes.

index worker Reads events from a Pravega source stream, processes continues queries, and sends events to shard workers.

resthead Receives query requests and forwards to either query workers (if they exist) or shard workers.

query worker An optional component. Manages query requests and responses for heavy query workloads. When queryworkers do not exist, queries go directly from resthead to shard workers.

shard worker Accepts processing from other components. Shard workers consume most of the CPU and memory resources in the cluster. From index workersShard workers index the events and save the Lucene indices into Pravega

bytestreams. Each shard worker has zero or more shards, each of which is a Lucene directory. QueryingShard workers process search queries forwarded to them from query workers or the

resthead.

Working with Pravega Search (PSearch) 111

Indications that scaling is needed

Workload affects resource consumption in the following ways:

Data ingestion rateAs the stream ingestion rate increases, the number of stream segments in a stream increases, leading to more Pravega readers in the index workers. The increased number of readers require more CPU and memory, which creates the need for more index workers. Finally, the heavier workload on index workers requires more shard workers.

Number of search requestsAs the search workload increases, the load on the resthead, queryworkers and shardworkers increases.

For projects that have Metrics enabled, you can monitor metrics about the components, the workload, and the resources. Monitoring stream segment numbers, CPU and memory usage of index workers, shard numbers, and indexes can point out trends and help predict scaling needs. When CPU and memory resources are approaching limits, scaling is required.

Types of scaling

Type What is scaled How to scale

Horizontal pod scaling Changes the number of pod replicas automaticScaling is automatic based on current metrics and configured values.

manualScaling is controlled by manually setting explicit values that override calculations.

Vertical pod scaling Changes the CPU & memory resources available to pod containers

ManualAdd resources to the configuration.

Cluster scaling Scales the Kubernetes cluster at the node level.

See "Expansion and Scaling" in the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/ content/docu103271.

Edit the autoscaling policy

Typically, users do not need to edit the autoscaling policy.

About this task

Autoscaling is performed by the psearch-scaler cronjob. This job runs every 5 minutes. It automatically increases or decreases the number of service replicas based on current metrics values and scaling policies. You may adjust the scaling policies to customize your automatic scaling results.

Changing the autoscaling policy is useful in some specific scenarios such as:

CPU and memory usage is high but the number of pod replicas has reached the upper bounds. If the cluster has sufficient resources, you can increase the maxScaleFactor. This change allows autoscaling to create more replicas.

A busy node is running a shardworker that is handling more than one large shard. The busy node is running low on available CPU and memory. Without intervention, the shardworker may be stopped with out of memory errors. If other nodes have extra resources, an adjustment to the autoscaling policy can help. In this case, change the shardworker configuration, increasing maxScaleFactor and decreasing shard_num. These changes result in more shardworkers and better distribution of the large shards. The workload is distributed among all the nodes.

Performance for indexing or continuous queries needs improvement, but the CPU and memory usages have not reached the scaling thresholds. In this case, decrease the cpu_m and memory_M threshholds in the Indexworker configuration. This change causes more aggressive scaling up.

Steps

1. Edit the psearch-scaling-policy configmap.

kubectl edit cm psearch-scaling-policy -n $project

2. Change settings for the various services according to need.

Common edits are:

Enlarge maxScaleFactor to increase the number of replicas that autoscaling can create.

112 Working with Pravega Search (PSearch)

Reduce the values in any of the targetAverageValue settings to increase the number optimal replicas, to make it easier to satisfy the scaling up condition.

Change minScaleFactor and maxScaleFactor to 1 to disable autoscaling.

Name Description

minScaleFactor A factor used to calculate the minReplicas for autoscaling.

maxScaleFactor A factor used to calculate the maxReplicas for autoscaling. Guidelines for the various services are: An efficient number of controllers is related to the total workload across the cluster. The

controller manages the cluster. An efficient number of indexworkers is related to the number of stream segments being

processed per indexworker. An efficient number of shardworkers is related to the average number of shards per

shardworker and sometimes the size of shards. Shardworkers are cpu and memory intensive. Too many shards running on one shardworker lowers performance.

shard_num The shardworker service includes an additional field named shard_num. For an index with very large shards, one or two shards can consume a few gigabytes of memory. In that case, you can decrease the value of shard_num to make the shardworker scale up more easily. For small indices, 10 shards per shardworker is reasonable.

metrics[...] Metrics fields determine the current optimal number of replicas. Current values for the listed metrics and their target values are used to calculate an optimal number of replicas. When multiple metrics are listed, autoscaling uses the largest calculated result as the optimal number of replicas. By reducing the values in any of the targetAverageValue settings, you make it easier to satisfy the scaling up condition.

The shardworker has an additional metric shard_num. In an index with very large shards, one or two shards can consume a few gigabytes of memory. In this case, decreasing the targetAverageValue for shard_num triggers scaling up more easily. For small indexes, a targetAverageValue of 10 shards per shardworker is reasonable.

minReplicas The current minimum number of replicas for autoscaling purposes. This value is calculated using the minScaleFactor times the initial number of replicas.

maxReplicas The current maximum number of replicas for autoscaling purposes. This value is calculated using the maxScaleFactor times the initial number of replicas.

Here is a view of the configuration:

Working with Pravega Search (PSearch) 113

Disable auto-scaling

You can disable horizontal autoscaling.

Steps

1. Edit the psearch-scaling-policy configmap.

kubectl edit cm psearch-scaling-policy -n $project

2. Change minScaleFactor and maxScaleFactor to 1 for every service.

Change to manual horizontal scaling

Manual scaling uses your explicit value for number of replicas, and scales up to the specified number.

About this task

Typical scenarios when manual scaling is useful are: If the workload has a large amount of incoming data but few query requests, you might improve performace by:

decreasing the replicas for the restthead and queryworker services, and also increasing the replicas for indexworkers and shardworkers

If you want to create a smaller or larger cluster.

Steps

1. Edit the PravegaSearch resource.

kubectl edit pravegasearch psearch -n $project

2. Edit the values in the replica labels. Set the specific number of replicas to create for each service.

3. Under spec, add source: manual.

Example

Here is a view of the PravegaSearch resource:

114 Working with Pravega Search (PSearch)

Vertical pod scaling

Vertical pod scaling adds cpu and memory to existing pods.

About this task

Use vertical pod scaling in these scenarios:

A service is running out of memory. Large shards are running in shard workers.

For performance reasons, some index data are cached in the shard worker. The shard worker might run out of memory if very large shards are running on it. In that case, you can increase the required memory and java heap size.

NOTE: This procedure does not affect existing statefulsets or pods. When you upgrade to a new Pravega Search version,

statefulsets and pods are recreated, using the new configurations.

Steps

1. Edit the PravegaSearch custom resource.

a. Run:

kubectl edit pravegasearch psearch -n $project

b. Change some or all the following values for indexworker or shardworker, as needed:

Under resources, change cpu and memory requests.

Under javaopts, change java heap size (the first value).

2. For existing pods, edit the configmaps of the components you want to change.

For example, edit the psearch-shardworker-configmap:

kubectl edit cm psearch-shardworker-configmap -n $project

You can enlarge the max heap value (-Xmx) or add other jvm options if needed.

Working with Pravega Search (PSearch) 115

3. Edit the statefulset of the component by increasing the resource request values (cpu or memory). The pod restarts after you save the settings.

kubectl edit sts psearch-shardworker -n $project

You can increase cpu or memory as follows:

Pravega Search REST API The Pravega Search REST API is available for use in applications that are deployed in an SDP project.

REST head

The root URL of the Pravega Search REST API is project-specific.

The Pravega Search API is not opened for external connectivity. An application that uses the Pravega Search REST API must be deployed within SDP, in the same project that contains the streams that are being searched.

Each project has a Pravega Search API root URL, as follows:

http://psearch-resthead-headless.{namespace}.svc.cluster.local:9098/

Where:

{namespace} is the project name.

For example:

http://psearch-resthead-headless.myproject.svc.cluster.local:9098/

Authentication

Introduction

To expose the Pravega Search REST API, the REST authentication is required in the Pravega Search RestHead. All the components that access the Pravega Search REST API must add authentication in the request header. The REST ingress is created for the RestHead service. Users do not need to port-forward the RestHead.

When the RestHead starts, the interceptor for Pravega Search REST authentication is registered. All REST requests are sent to the interceptor first. Two types of authentication are supported.

116 Working with Pravega Search (PSearch)

Basic authentication

The username and password is used for authentication. Every Pravega Search cluster has a username and password generated dynamically and randomly. If the username and password from the request header equals the username and password from the environment variables created by the Psearch-Operator, access is granted.

Bearer authentication

The RPT token from Keycloak is used for authentication. The RPT token is generated by Keycloak. Users need to get the RPT token from Keycloak and add it to the request header. The interceptor parses the request header to get the RPT token and checks the access role associated with the token. The role must be either admin or the appropriate project member role. Otherwise, the 401 error response is returned.

How to get the ingress hosts for Pravega Search REST access

Use the following command:

kubectl get ingress psearch-resthead-headless -n <projectName>

How to get the credentials for Basic authentication

To get the username and password:

kubectl exec -it psearch-resthead-0 -n <projectName> env / grep REST_AUTH

To get the value for basic_credentials:

echo -n "username:password" | base64

How to get the credentials for Bearer authentication

To get the RPT_Token:

sh ./pravega-search/tools/keycloakToken/token.sh \ -p (projectName) -c (clientSecret) -i (keycloak ingress)

To get the clientSecret, use the Keycloak UI or the secret file from the Kubernetes pod.

To get the Keycloak ingress:

kubectl get ingress -A | grep keycloak

Example: How to get access to the Pravega Search REST API with authentication

The following example shows how to get access to the Pravega Search REST API with Basic authentication:

curl --location --request GET 'http://(ingress_hosts)/_cluster/state' \ --header 'Authorization: Basic (basic_credentials)'

The following example shows how to get access to the Pravega Search REST API with Bearer authentication:

curl --location --request GET 'http://(ingress_hosts)/_cluster/state' \ --header 'Authorization:Bearer (RPT_Token) '

Component Impact

The following components require access to the Pravega Search REST API:

Nautilus UI for Pravega Search - Bearer Kibana - Bearer and Basic

Working with Pravega Search (PSearch) 117

For Basic authentication, Kibana gets username and password from the rest-auth-secret secret in the project namespace.

When users open the Kibana UI, the request with the RPT token is redirected to do the authentication first. It uses Bearer authentication. After that, when users access the Pravega Search REST API in the Kibana dashboard, the username and password that was created by the PSearch-Operator is added to all request headers. This is configured in the Kibana initialization script.

Resources

The API provides access to Pravega Search clusters, searchable streams, and continuous queries.

Pravega Search cluster

Pravega Search indexing and querying runs in a Pravega Search cluster. The Pravega Search cluster is in a project namespace. Each project can have only one Pravega Search cluster, which handles indexing and querying operations for all the streams in the project.

The Pravega Search REST API supports the following actions for Pravega Search clusters: Create a Pravega Search cluster for a project. Change a Pravega Search cluster configuration.

PUT requires the entire resource. PATCH requires only the field you want to change

Delete a Pravega Search cluster. List the Pravega Search cluster in a project namespace. List all Pravega Search clusters in all namespaces. Get information about a Pravega Search cluster.

Searchable stream

A searchable stream is any Pravega stream that is marked as searchable by Pravega Search. When a stream is made searchable, it has an index mapping associated with it that defines the stream data structure. Pravega Search uses the index mapping to index the stream's existing data and all new data as the data is ingested. You can register continuous queries and submit historical queries against a searchable stream.

The Pravega Search REST API supports the following actions for searchable streams: Declare a stream searchable. Get information about a searchable stream. List searchable streams. Remove searchable property from a stream.

Continuous query A continuous query operates on one or more streams in real time as the data is ingested. It writes results to an output stream.

The Pravega Search REST API supports the following actions for continuous queries: Create a continuous query, registering it to operate on one or more streams. Get information about a continuous query. Update a continuous query definition. Delete a continuous query. Get a list of continuous queries.

Search query Search queries operate on a stream after it is indexed. The search query (historical query) resource lets a user perform a query in Kibana.

Common object descriptions

The following table describes the objects that appear in many of the Pravega Search REST API requests.

Object Description

{namespace} A Kubernetes namespace specific to a project. A project namespace is the same as the project name.

{psearch-name} The name that was assigned to the Pravega Search cluster when the cluster was created. Because there is only one cluster in a project, the name is fixed as psearch.

{streamname} The name of a stream.

{queryname} The name of a saved continuous query.

118 Working with Pravega Search (PSearch)

Methods documentation

The Pravega Search REST API documentation is available for download as a zip file.

Download the file . Extract all files, and open the index.html file.

The above documentation does not include the CREATE psearch cluster API. See the next section for that method.

Cluster resource operations

The following table summarizes cluster resource operations.

Operation Method and URI

Create a cluster POST {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/

Body

See https://eos2git.cec.lab.emc.com/NAUT/psearch-operator/tree/master/ charts/psearch for a body template. Many configurations are supported when you are creating the cluster with the REST API. Note that on SDP UI, only a few of the configurations are offered.

Delete a cluster DELETE {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

List all clusters GET {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches

Retrieve information about a specific cluster GET {kubenetes_rest_root}/ apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

Edit a cluster using PATCH PATCH {kubenetes_rest_root}/apis/ search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

Header

Header: content-type:application/merge-patch+json

Body

{ "spec": { }}

Edit a cluster using PUT PUT {kubenetes_rest_root}/apis/ search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/{psearch-name}

Body

Working with Pravega Search (PSearch) 119

Operation Method and URI

Use the response payload from GET, editing fields as needed.

Create a cluster

Create a PSearch cluster in a project namespace. A project namespace may have only one PSearch cluster.

Request

POST {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/

For a template, see https://eos2git.cec.lab.emc.com/NAUT/psearch-operator/tree/master/charts/psearch.

The REST API is more flexible than the SDP UI for cluster configurations. The limited selections available in the SDP UI accommodate the most typical customer requirements.

Object descriptions

See the template for object descriptions.

Example request

The following example request creates a small cluster.

POST {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ {namespace}/pravegasearches/

{ "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "name": "psearch"

"namespace":"project-1" }, "spec": { "source": "manual", "imageRef": "psearch-image", "indexWorker": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 4, "resources": { "requests": { "cpu": "500m", "memory": "1Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" }

120 Working with Pravega Search (PSearch)

} }, "kibana": { "replicas": 1 }, "options": { "psearch.metricsEnableStatistics": "true", "psearch.metricsEnableStatsDReporter": "false", "psearch.metricsInfluxDBUri": "http://psearch-influxdb:8086", "psearch.metricsInfluxDBUserName": "admin" }, "pSearchController": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 3, "resources": { "requests": { "cpu": "100m", "memory": "500Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "queryWorker": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 2, "resources": { "requests": { "cpu": "100m", "memory": "500Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "resthead": { "jvmOpts": [ "-Xmx2g", "-XX:+PrintGC",

Working with Pravega Search (PSearch) 121

"-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 2, "resources": { "requests": { "cpu": "100m", "memory": "500Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "shardWorker": { "jvmOpts": [ "-Xmx3g", "-XX:MaxDirectMemorySize=3g", "-XX:+PrintGC", "-XX:+PrintGCDetails", "-Xloggc:log/gc.log" ], "replicas": 2, "resources": { "requests": { "cpu": "1", "memory": "3Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } } }

}

Sample Responses:

Example responses

Description Response body

Created

HTTP Status: 201 { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch",

122 Working with Pravega Search (PSearch)

Description Response body

"metadata": { "creationTimestamp": "2020-03-04T03:52:59Z", "generation": 1, "name": "psearch", "namespace": "ccc", "resourceVersion": "4654248", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/ccc/ pravegasearches/psearch", "uid": "08100ca4-541c-41c8-a03e-0e6bfd414d3d" }, "spec": { "externalAccess": { "enabled": false, "type": "ClusterIP" }, "imageRef": "cluster-psearch-image", "indexWorker": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "kibana": { "replicas": 1 }, "options": { "psearch.leaderElectionType": "fast", "psearch.luceneDirectoryType": "pravega_bytestream", "psearch.metricsEnableStatistics": "false", "psearch.metricsInfluxDBName": "psearch", "psearch.metricsInfluxDBPassword": "password", "psearch.metricsInfluxDBUri": "http://pravega-search- influxdb.default.svc.cluster.local:8086", "psearch.metricsInfluxDBUserName": "admin", "psearch.perfCountEnabled": "true" }, "pSearchController": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ],

Working with Pravega Search (PSearch) 123

Description Response body

"dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "pravegaControllerAddress": "tcp://nautilus-pravega- controller.nautilus-pravega.svc.cluster.local:9090", "queryWorker": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "resthead": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "securityEnabled": true, "shardWorker": { "jvmOpts": [ "-Xmx3g", "-XX:MaxDirectMemorySize=3g" ], "replicas": 2, "resources": { "requests": { "memory": "3330Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" }

124 Working with Pravega Search (PSearch)

Description Response body

}, "storageClassName": "standard" } } } } }

Already Exists

HTTP Status: 409 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "pravegasearches.search.nautilus.dellemc.com \"psearch\" already exists", "reason": "AlreadyExists", "details": { "name": "psearch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches" }, "code": 409 }

Delete a cluster

Delete an existing cluster from a project namespace.

Request

DELETE {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Example request

The following example request deletes the PSearch cluster named psearch.

DELETE {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/project-1/ pravegasearches/psearch

Example responses

Description Response body

Deleted

HTTP Status: 200 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Success", "details": { "name": "psearch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches", "uid": "08100ca4-541c-41c8-a03e-0e6bfd414d3d" } }

Working with Pravega Search (PSearch) 125

Description Response body

Not Found

HTTP Status: 404 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "pravegasearches.search.nautilus.dellemc.com \"psearch\" not found", "reason": "NotFound", "details": { "name": "psearch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches" }, "code": 404 }

List clusters

List PSearch clusters and their status.

Request

List PSearch clusters in a namespace. There is always only one PSearch cluster per namespace.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches

List all PSearch clusters in all namespaces.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/pravegasearches

Example for one project namespace

The following request lists the PSearch cluster and its status for the bbb project.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/ pravegasearches

The response format is:

{ "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "items": [ { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/ psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": {

...

},

126 Working with Pravega Search (PSearch)

"status": { "ingress": { "url": "" }, "ready": true, "reason": "" } } ], "kind": "PravegaSearchList", "metadata": { "continue": "", "resourceVersion": "4687091", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/" } }

Example for all projects

The following request lists all PSearch clusters and their status in all namespaces.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/pravegasearches

The response format is:

{ "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "items": [ { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/ psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": {

...

}, "status": { "ingress": { "url": "" }, "ready": true, "reason": "" } } ], "kind": "PravegaSearchList", "metadata": { "continue": "", "resourceVersion": "4687681", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/pravegasearches/" } }

Response codes

Working with Pravega Search (PSearch) 127

Get a cluster

Retrieve information about a specific cluster and its status.

Request

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Example request

The following request gets information about the cluster named psearch in the bbb project.

GET {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/ pravegasearches/psearch

Example responses

The status information appears at the end of the payload.

Description Response body

Get resource

HTTP Status: 200 { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/ namespaces/bbb/pravegasearches/psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": { "externalAccess": { "enabled": false, "type": "ClusterIP" }, "imageRef": "cluster-psearch-image", "indexWorker": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } }

128 Working with Pravega Search (PSearch)

Description Response body

}, "kibana": { "replicas": 1 }, "options": { "psearch.leaderElectionType": "fast", "psearch.luceneDirectoryType": "pravega_bytestream", "psearch.metricsEnableStatistics": "false", "psearch.metricsInfluxDBName": "psearch", "psearch.metricsInfluxDBPassword": "password", "psearch.metricsInfluxDBUri": "http://pravega-search- influxdb.default.svc.cluster.local:8086", "psearch.metricsInfluxDBUserName": "admin", "psearch.perfCountEnabled": "true" }, "pSearchController": { "jvmOpts": [ "-Xmx2g" ], "replicas": 2, "resources": { "requests": { "memory": "2Gi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "pravegaControllerAddress": "tcp://nautilus-pravega- controller.nautilus-pravega.svc.cluster.local:9090", "queryWorker": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "resthead": { "replicas": 2, "resources": {}, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": {

Working with Pravega Search (PSearch) 129

Description Response body

"requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } }, "securityEnabled": true, "shardWorker": { "jvmOpts": [ "-Xmx3g", "-XX:MaxDirectMemorySize=3g" ], "replicas": 2, "resources": { "requests": { "memory": "3330Mi" } }, "storage": { "VolumeClaimTemplate": { "accessModes": [ "ReadWriteOnce" ], "dataSource": null, "resources": { "requests": { "storage": "10Gi" } }, "storageClassName": "standard" } } } }, "status": { "ingress": { "url": "" }, "ready": true, "reason": "" } }

Not Found

HTTP Status: 404 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "pravegasearches.search.nautilus.dellemc.com \"pseawrch\" not found", "reason": "NotFound", "details": { "name": "pseawrch", "group": "search.nautilus.dellemc.com", "kind": "pravegasearches" }, "code": 404 }

130 Working with Pravega Search (PSearch)

Edit a cluster using PATCH

Change a PSearch cluster configuration by providing only the changed information in the request body.

Request

PATCH {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Header

Header: content-type:application/merge-patch+json

Body

Supply only the object that you are changing.

Example request

The following example request changes the number of indexworkers.

PATCH {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Header

Header: content-type:application/merge-patch+json

Body

{ "spec": { "indexworker": { "replicas" : 2 } }}

Example responses

Description Response body

HTTP Status: xxx HTTP responses

Edit a cluster using PUT

Change a PSearch cluster configuration by providing the entire spec object in the request body.

Request

PUT {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name}

Body

Use the response payload from GET, editing fields as needed.

Working with Pravega Search (PSearch) 131

Example request

The following example request changes the configuration of the cluster.

PUT {kubenetes_rest_root}/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/{namespace}/ pravegasearches/{psearch-name} { "apiVersion": "search.nautilus.dellemc.com/v1alpha1", "kind": "PravegaSearch", "metadata": { "creationTimestamp": "2020-03-03T05:28:58Z", "generation": 3, "name": "psearch", "namespace": "bbb", "resourceVersion": "4338841", "selfLink": "/apis/search.nautilus.dellemc.com/v1alpha1/namespaces/bbb/pravegasearches/ psearch", "uid": "7748428b-6c45-486d-8649-94f61f9c4545" }, "spec": { ...}

Example responses

Description Response body

Created or other message

HTTP Status: xxx

HTTP responses.

132 Working with Pravega Search (PSearch)

Troubleshooting Application Deployments You might encounter errors during deployment and execution of Apache Flink and Spark applications. Logging is an essential tool for debugging and understanding the behavior of your applications.

Users in the platform administrator role can access all system logs using native Kubernetes commands.

Topics:

Apache Flink application failures Application logging Application issues related to TLS Apache Spark Troubleshooting

Apache Flink application failures If a deployment fails, an Apache Flink application reaches an error state.

Application failed to resolve Maven coordinates

If the mavenCoordinate is incorrect and cannot be found in the local analytics project Maven repository, you get errors. To resolve errors, review the full or partial log files.

Application failed to deploy

If one of the following exists, a Flink application might fail to deploy:

The mainClass cannot be found in the JAR file. The Flink driver application fails with an application exception error.

To identify the cause of a failure, run the kubectl describe command to retrieve a partial log. This command displays the last 50 lines of the log. This information is enough to identify the cause of the failure.

$> kubectl describe FlinkApplication -n myproject helloworld

...

... Status: Deployment: Flink Cluster: Image: devops-repo.isus.emc.com:8116/nautilus/flink:0.0.1-017.d7d7da1-1.6.2 Name: mycluster URL: mycluster-jobmanager.myproject:8081 Job Id: Job State: Savepoint: file:/mnt/flink/savepoints/savepoint-3d2dfb-b37f0c8ad3a5 Deployment Number: 1 Error: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 5 more

2019-02-04 16:02:16,759 INFO org.apache.flink.core.fs.FileSystem

9

Troubleshooting Application Deployments 133

- Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.

2019-02-04 16:02:17,024 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.

2019-02-04 16:02:17,047 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.

2019-02-04 16:02:17,106 INFO org.apache.flink.client.cli.CliFrontend - Running 'run' command.

2019-02-04 16:02:17,111 INFO org.apache.flink.client.cli.CliFrontend - Building program from JAR file

2019-02-04 16:02:17,158 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException:

The program's entry point class 'com.dellemc.flink.error.StreamingJob' was not found in the jar file.

org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617)

at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java :30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.flink.error.StreamingJob

at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614)

... 7 more

------------------------------------------------------------ The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'com.dellemc.flink.error.StreamingJob' was not found in the jar file.

at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617) at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java :30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.flink.error.StreamingJob

at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method)

134 Troubleshooting Application Deployments

at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614)

... 7 more

Observable Generation: 4 Reason: Flink Driver Failed State: error

Review the full deployment log

If you need to review the full deployment log, first identify the deployment pod and then use the kubectl logs command to retrieve the log output.

Steps

1. Identify the deployment pod using the kubectl get pods command, which returns the deployment pod name in the format -app- - .

> kubectl get pods -n taxidemo

NAME READY STATUS RESTARTS AGE kubectl get pods -n taxidemo

NAME READY STATUS RESTARTS AGE iot-gateway 1/1 Running 0 27m repo-5689d8c9b-ccs2p 1/1 Running 0 34m taxidemo-jobmanager-0 1/1 Running 0 16m taxidemo-rawdata-app-v1-1-mllnb 0/1 Completed 0 16m <<<<<<<<<<< taxidemo-rawdata-app-v2-1-gdfjl 0/1 Completed 0 2m <<<<<<<<<<< taxidemo-stats-app-v1-1-ldtks 0/1 Completed 0 16m <<<<<<<<<<< taxidemo-taskmanager-b596fcc55-j6jtd 1/1 Running 2 16m zookeeper-0 1/1 Running 0 34m zookeeper-1 1/1 Running 0 34m zookeeper-2 1/1 Running 0 33m

2. Use the kubectl logs command to obtain the full log output. For example:

> kubectl logs taxidemo-rawdata-app-v2-1-gdfjl -n taxidemo Executing Flink Run ------------------- 2019-02-07 16:20:25,285 INFO org.apache.flink.client.cli.CliFrontend - -------------------------------------------------------------------------------- 2019-02-07 16:20:25,286 INFO org.apache.flink.client.cli.CliFrontend - Starting Command Line Client (Version: 1.6.2, Rev:3456ad0, Date:17.10.2018 @ 18:46:46 GMT)

2019-02-07 16:20:25,286 INFO org.apache.flink.client.cli.CliFrontend - OS current user: root

2019-02-07 16:20:25,286 INFO org.apache.flink.client.cli.CliFrontend - Current Hadoop/Kerberos user:

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - Maximum heap size: 878 MiBytes

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - JAVA_HOME: /docker-java-home/jre

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - No Hadoop Dependency available

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - JVM Options:

Troubleshooting Application Deployments 135

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -Dlog.file=/opt/flink/log/flink--client-taxidemo-rawdata-app-v2-1-gdfjl.log

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -Dlog4j.configuration=file:/etc/flink/log4j-cli.properties

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -Dlogback.configurationFile=file:/etc/flink/logback.xml

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - Program Arguments:

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - run 2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -d

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -s

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - file:/mnt/flink/savepoints/savepoint-18b407-f5f5a79d567f

2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - -c 2019-02-07 16:20:25,287 INFO org.apache.flink.client.cli.CliFrontend - com.dellemc.BadDriver ...

2019-02-07 16:20:25,320 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2019-02-07 16:20:25,470 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.

2019-02-07 16:20:25,513 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.

2019-02-07 16:20:25,568 INFO org.apache.flink.client.cli.CliFrontend - Running 'run' command.

2019-02-07 16:20:25,572 INFO org.apache.flink.client.cli.CliFrontend - Building program from JAR file

2019-02-07 16:20:25,670 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'com.dellemc.BadDriver' was not found in the jar file. at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617 ) at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.j ava:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.BadDriver

at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method)

136 Troubleshooting Application Deployments

at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614 ) ... 7 more

------------------------------------------------------------ The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'com.dellemc.BadDriver' was not found in the jar file.

at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:617 ) at org.apache.flink.client.program.PackagedProgram. (PackagedProgram.java:199) at org.apache.flink.client.cli.CliFrontend.buildProgram(CliFrontend.java:865) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:207) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1129) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.j ava:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1129)

Caused by: java.lang.ClassNotFoundException: com.dellemc.BadDriver at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:614 ) ... 7 more

...

... Error Running Flink Driver Application

Application logging Flink applications consist of Flink tasks and task managers. Any logging an operator might perform is contained in the TaskManager logs of the target cluster.

Flink operator tasks can run on any of the TaskManagers within the Flink cluster. You first must identify the TaskManagers of the target cluster before you can review logs.

Review application logs

Identify the TaskManagers of the target cluster and view logs.

Steps

1. First identify the TaskManagers of the target cluster. These task managers are named -taskmanager- . For example:

> kubectl get pods -n taxidemo NAME READY STATUS RESTARTS AGE iot-gateway 1/1 Running 0 42m repo-5689d8c9b-ccs2p 1/1 Running 0 50m taxidemo-jobmanager-0 1/1 Running 0 32m

Troubleshooting Application Deployments 137

taxidemo-rawdata-app-v1-1-mllnb 0/1 Completed 0 32m taxidemo-rawdata-app-v2-1-gdfjl 0/1 Completed 0 17m taxidemo-stats-app-v1-1-ldtks 0/1 Completed 0 32m taxidemo-taskmanager-b596fcc55-j6jtd 1/1 Running 2 32m <<<<<<<<< zookeeper-0 1/1 Running 0 50m zookeeper-1 1/1 Running 0 49m zookeeper-2 1/1 Running 0 49m

2. Run the kubectl logs command to view the logs. For example:

> kubectl logs taxidemo-taskmanager-b596fcc55-j6jtd -n taxidemo Starting Task Manager config file: jobmanager.rpc.address: taxidemo-taskmanager-b596fcc55-j6jtd jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.heap.size: 1024m taskmanager.numberOfTaskSlots: 2 parallelism.default: 1 rest.port: 8081 blob.server.port: 6124 query.server.port: 6125 Starting taskexecutor as a console application on host taxidemo-taskmanager-b596fcc55- j6jtd.

2019-02-07 16:06:06,165 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.6.2, Rev:3456ad0, Date:17.10.2018 @ 18:46:46 GMT)

2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: flink

2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user:

2019-02-07 16:06:06,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: /docker-java-home/jre

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options:

2019-02-07 16:06:06,167 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/etc/flink/log4j-console.properties

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/etc/flink/logback-console.xml

138 Troubleshooting Application Deployments

2019-02-07 16:06:06,168 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments:

3. You can also follow new log entries using the kubectl logs command with the -f parameter. For example:

> kubectl logs taxidemo-taskmanager-b596fcc55-j6jtd -n taxidemo -f ... ... ... 2019-02-07 16:41:38,678 INFO io.pravega.client.netty.impl.RawClient - Closing connection with exception: null

2019-02-07 16:41:39,695 INFO io.pravega.client.netty.impl.ClientConnectionInboundHandler - Connection established ChannelHandlerContext(ClientConnectionInboundHandler#0, [id: 0x742c134f])

2019-02-07 16:41:39,698 INFO io.pravega.client.netty.impl.RawClient - Closing connection with exception: null

2019-02-07 16:41:44,425 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Attempting to refresh the controller server endpoints

2019-02-07 16:41:44,425 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Attempting to refresh the controller server endpoints

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Updating client with controllers: [[addrs=[nautilus-pravega-controller.nautilus-pravega.svc.cluster.local/ 10.100.200.14:9090], attrs={}]]

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Updating client with controllers: [[addrs=[nautilus-pravega-controller.nautilus-pravega.svc.cluster.local/ 10.100.200.14:9090], attrs={}]]

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Rescheduling ControllerNameResolver task for after 120000 ms

2019-02-07 16:41:44,427 INFO io.pravega.client.stream.impl.ControllerResolverFactory - Rescheduling ControllerNameResolver task for after 120000 ms

2019-02-07 16:41:49,757 INFO io.pravega.client.netty.impl.ClientConnectionInboundHandler - Connection established ChannelHandlerContext(ClientConnectionInboundHandler#0, [id: 0xf2faca14])

2019-02-07 16:41:49,759 INFO io.pravega.client.netty.impl.RawClient

Application issues related to TLS The SDP supports native Transport Layer Security (TLS) for external connections to the platform. This section describes TLS-related connection information in Pravega and Flink applications.

The TLS feature is optional and is enabled or disabled with a true/false setting in the configuration values file during installation. For more information, see the TLS configuration details section in the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/content/docu103271

When TLS is enabled in the SDP, Pravega applications must use a TLS endpoint to access the Pravega datastore.

The URI used in the Pravega client application, in the ClientConfig class, must start with:

tls://:443

NOTE: If the URI starts with tcp://, the application fails with a javax.net.ssl.SSLHandshakeException error.

Troubleshooting Application Deployments 139

To obtain the Pravega ingress endpoint, run the following command:

kubectl get ing pravega-controller -n nautilus-pravega

The HOST column in the output shows the Pravega endpoint.

Apache Spark Troubleshooting Authentication failed is a common error message.

Symptom io.pravega.shaded.io.grpc.StatusRuntimeException: UNAUTHENTICATED: Authentication failed

Causes Keycloak JAR is not included. Attempt to create a scope. Make sure that the option allow_create_scope is set to false.

140 Troubleshooting Application Deployments

SDP Code Hub Applications and Sample Code You can develop streaming data applications using sample code templates available on the SDP Code Hub. Dell EMC monitors this hub, which is a community-supported public GitHub portal for developers and integrators. It includes Pravega connectors, demos, sample applications, API templates, along with Pravega, Spark, and Flink examples from the open-source SDP developer community.

The SDP Code Hub includes three main categories:

Code samples Example projects that demonstrate a single concept or feature, for example, exactly-once processing using the Pravega

client. Connectors

Libraries or components that can be used to integrate Pravega with other technologies. Demos

Small but complete applications that tell a story and demonstrate a user case, for example, iDRAC telemetry and video facial recognition.

Visit the SDP Code Hub: https://streamingdataplatform.github.io/code-hub/ to browse articles about open-source SDP projects and download examples. The SDP Platform Developer repository includes additional workshop code samples: https:// github.com/StreamingDataPlatform/workshop-samples

The SDP Code Hub portal does not contain any product downloads, advertising, commercial, or proprietary software.

Topics:

Develop an example video processing application

Develop an example video processing application The SDP Code Hub includes a video processing code example that demonstrates how to perform data ingestion, analytics, and visualization, with methods to store, process, and read video using Pravega and Apache Flink. The SDP Code Hub also includes an application that performs object detection using TensorFlow and YOLO.

The video processing example project on the SDP Code Hub shows how to:

Start with local-only development to avoid complications and iterate quickly. Flink and Pravega run on your local machine for this example.

Install a local stand-alone copy of Pravega in Docker. Create a Pravega scope using the Pravega RESTful APIs. It requires you to:

Set up application parameters using the IntelliJ IDE from JetBrains (available for free download). Connect a USB camera to a VMware virtual machine. Run the example application. Understand the example Java source code.

To get the most out of this project, you need a basic understanding of the high-level architecture of the SDP including: Pravega for processing and storage, Apache Flink or Spark for analysis, Kubernetes, authentication and metrics, and back-end storage.

Download the installation files from https://github.com/pravega/video-samples. Use the examples, along with the following required components, available for download from the sites listed below:

Java 8 (JDK 1.8)The Java programming language and Java development life cycle, including writing, compiling, and creating JAR files and running the JDK or JRE.

Kubernetes--Kubernetes (K8s) for automating deployment, scaling, and management of containerized applications. A Kubernetes tutorial contains scripts and code to deploy all components on Kubernetes. See https://kubernetes.io/.

DockerInstall Docker and Docker Compose to simplify the deployment of the components: https://docs.docker.com/ v17.12/install/

10

SDP Code Hub Applications and Sample Code 141

PravegaFor additional SDP developer code examples, see the SDP Code Hub: https://streamingdataplatform.github.io/ code-hub/ For information about Pravega, see http://pravega.io.

Apache FlinkSee https://flink.apache.org for more information. Streaming Data Generator--A streaming data generator has been created for this example that can generate synthetic

sensor and video data and send it to the Pravega Gateway. These Python scripts are in the streaming_data_generator directory.

Pravega Gateway--This GRPC server provides a gateway to a Pravega server. It provides limited Pravega functionality to any environment that supports GRPC, including Python. For more information, see Pravega Gateway.

Maven 3.6 Gradle 4.0 A webcam installed on your local workstation.

Overview of a video processing application

The SDP is not required for local application development and testing. You should start development on your local machine because it is easier to iterate quickly and troubleshoot applications when Flink or Spark applications and Pravega are installed locally. When you install and develop locally, your log files are also local.

Once you have an application developed, you can debug it locally using Docker and then deploy and test it on the SDP.

It usually takes a few minutes to deploy an application to the Platform, including uploading a JAR file, starting your clusters, and running your applications.

In this example, you send figures from a locally installed web camera to an application called Camera Recorder that reads frames from the Camera Recorder application and writes the data to a Pravega stream called the Video Stream.

Figure 25. Sample application for video processing with Pravega

The Video Data Generator Job generates random images that go into the same video stream. These random images do not have metadata associated with them, so you can distinguish those images from the ones coming from the camera recorder, which are tagged with metadata. There is a timestamp associated with each image in the video.

A Flink job called Multi-Video Grid Job processes the images. In a real-world situation, images from different servers would have different timestamps. As a result, this application aligns images in windows of time (10 milliseconds) to create a single image grid output of four images. This scenario could be used if you wanted to record four camera outputs at a time from independent video streams. It could also be used for real-time machine learning based on four different camera inputs. It aligns the images by time and then displays multi-video streaming output.

The Multi-Video Grid Job performs several steps:

Resizes images (scales multiple images down in parallel). Aligns images by their time window. Because the images have timestamps that are slightly off, the job aligns them to be

exactly within 10 milliseconds of each other. Joins the images as a single image in a grid and presents them as a single input and writes them to a new stream called the

Grid Stream.

The Video Player application then reads the streams from the Grid Stream and displays them to your screen.

For more information about building and running the video samples, browse and download the samples from: https:// github.com/pravega/video-samples

142 SDP Code Hub Applications and Sample Code

References This section includes Keycloak and Kubernetes user role definitions and additional support resources.

Topics:

Keycloak and Kubernetes user roles Additional resources Where to go for support Provide feedback about this document

Keycloak and Kubernetes user roles This section describes the roles used within the Streaming Data Platform.

Streaming Data Platform user roles

Role name in Keycloak Description

admin Users with complete wild-card access to all resources across all services. These users have wildcard access to all projects and all Pravega scopes. One user with this role is created by the installation. You may create additional usernames with the admin role.

-member Users or clients in Keycloak with access to specifically protected resources for the named project. Access includes but is not limited to: Maven artifacts, Pravega scopes in the project's namespace, and Flink clusters defined in the project. These role names are created whenever a new project is created.

Kubernetes user roles

Role name in Kubernetes Description

cluster-admin Users with complete access to all K8s resources within the cluster. If you follow recommended procedures for creating users and assigning roles, the admin users in the previous section become cluster-admin users in the cluster, through impersonation.

-member Users with curated access to most resources in the project namespace. Project members are aggregated to the common edit role.

NOTE: In order to be able to use kubectl commands, admin and non-admin users must be UAA users.

For more information, see the Dell EMC Streaming Data Platform Installation and Administration Guide at https://dl.dell.com/ content/docu103271

11

References 143

Additional resources This section provides additional resources for application developers using the SDP.

Topic Reference

SDP (SDP) Code Hub SDP Code Hub: https://streamingdataplatform.github.io/ code-hub/

SDP Documentation InfoHub Documentation resources from Dell EMC: https://www.dell.com/support/article/us/en/19/ sln319974/dell-emc-streaming-data-platform-infohub

Apache Flink concepts, tutorials, guidelines, and Flink API documentation.

Apache Flink Open Source Project documentation: https://flink.apache.org/

Ververica Flink documentation: https:// www.ververica.com/

Apache SparkTM concepts, tutorials, guidelines, and Spark API documentation.

Apache Spark Open Source Project documentation: https://spark.apache.org

Pravega Spark Connector https://github.com/pravega/ spark-connectors

Spark 2.4.5 Programming Guide https://spark.apache.org/ docs/latest/rdd-programming-guide.html

Spark 2.4.5 Streaming Documentation https://spark.apache.org/docs/latest/ streaming-programming-guide.html

Spark Tuning https://spark.apache.org/docs/latest/ tuning.html

Pravega concepts, architecture, use cases, and the Pravega Controller API documentation

Pravega Open Source Project documentation: http:// www.pravega.io.

For Pravega concepts, see http://pravega.io/docs/latest/ pravega-concepts/

Flink connector code examples for Pravega Blog post: https:// pravega.github.io/workshop-samples/processing%20data/ 2020/03/08/Flink-Connector-Examples-For-Pravega.html

Where to go for support Dell Technologies Secure Remote Services (SRS) and call home features are available for the SDP. These features require an SRS Gateway server configured on-site to monitor the platform. The SDP installation process configures the connection to the SRS Gateway. An SRS Gateway v 3.38 or greater is required. Detected problems are forwarded to Dell Technologies as actionable alerts, and support teams can remotely connect to the platform to help with troubleshooting.

Dell Technologies support

Support tab on the Dell homepage: https://www.dell.com/support/incidents-online. Once you identify your product, the "How to Contact Us" gives you the option of email, chat, or telephone support.

SDP support landing page

Product information, drivers and downloads, and knowledge base articles: SDP product support landing page.

Telephone support United States: 1-800-SVC-4EMC (1-800-782-4362) Canada: 1-800-543-4782 Worldwide: 1-508-497-7901 Local phone numbers for a specific country/region are available at Dell EMC Customer Support

Centers.

144 References

Provide feedback about this document Your suggestions help to improve the accuracy, organization, and overall quality of the documen

Manualsnet FAQs

If you want to find out how the 1.2 Dell works, you can view and download the Dell Streaming Data Platform 1.2 Software Developer's Guide on the Manualsnet website.

Yes, we have the Developer's Guide for Dell 1.2 as well as other Dell manuals. All you need to do is to use our search bar and find the user manual that you are looking for.

The Developer's Guide should include all the details that are needed to use a Dell 1.2. Full manuals and user guide PDFs can be downloaded from Manualsnet.com.

The best way to navigate the Dell Streaming Data Platform 1.2 Software Developer's Guide is by checking the Table of Contents at the top of the page where available. This allows you to navigate a manual by jumping to the section you are looking for.

This Dell Streaming Data Platform 1.2 Software Developer's Guide consists of sections like Table of Contents, to name a few. For easier navigation, use the Table of Contents in the upper left corner.

You can download Dell Streaming Data Platform 1.2 Software Developer's Guide free of charge simply by clicking the “download” button in the upper right corner of any manuals page. This feature allows you to download any manual in a couple of seconds and is generally in PDF format. You can also save a manual for later by adding it to your saved documents in the user profile.

To be able to print Dell Streaming Data Platform 1.2 Software Developer's Guide, simply download the document to your computer. Once downloaded, open the PDF file and print the Dell Streaming Data Platform 1.2 Software Developer's Guide as you would any other document. This can usually be achieved by clicking on “File” and then “Print” from the menu bar.