Data Integration for Neo4j using Kettle Matt Casters, - - PowerPoint PPT Presentation

▶

Jan 17, 2024 352 likes •809 views

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect Topics What is Kettle? Kettle plugins for Neo4j Kettle using Neo4j Examples The Hunger Games

SLIDE 1

Data Integration for Neo4j using Kettle

Matt Casters, matt.casters@neo4j.com

mattcasters Neo4j Chief Solutions Architect

SLIDE 2

Topics

➢ What is Kettle? ➢ Kettle plugins for Neo4j ➢ Kettle using Neo4j ➢ Examples ➢ The Hunger Games ➢ Q&A

SLIDE 3

What is Kettle?

SLIDE 4

Kettle: Introduction

➢ A visual programming tool for data orchestration ➢ A.k.a. Pentaho Data Integration from Hitachi Vantara ➢ Over 15 years old ➢ Open source under Apache Public License 2.0 ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download your Neo4j remix at www.kettle.be

SLIDE 5

Kettle: where is it used?

➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pi sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture

SLIDE 6

Kettle: Why is it used?

➢ Reduce costs, reach goals faster ➢ Answers the “build or buy?” question

build b u y Time Accum. Cost Kettle

SLIDE 7

Kettle: Architecture

➢ Metadata driven, engine based : ○ No code generation

○ Define what you need to happen

→ GUI, Web, code, rules, …

○ Clear and transparent, self documenting

➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming

SLIDE 8

Kettle: Design

➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging / breakpoints, data previewing, row sniff testing, …

SLIDE 9

Other Kettle options available to you...

➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Native file system protocols: hdfs://, s3://, gs:// … ➢ Hadoop support through a compatibility layer ➢ Kettle Beam: execute transformations on Apache Spark, Apache Flink and GCP DataFlow

SLIDE 10

Kettle: The Toolset

➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...

SLIDE 11

Architecture

Version Control System git

Checkout version

Deploy System VM, docker, ...

Configure - setup - initialize - run Artifacts, graphs, configurations

SLIDE 12

Kettle plugins for Neo4j

SLIDE 13

Plugins: Neo4j Cypher

➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parallel execution ➢ High performance ➢ Call procedures

SLIDE 14

Plugins: Neo4j Output

➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Parallel execution ➢ Dynamic labels

SLIDE 15

Plugins: Neo4j Graph Output

➢ Update parts of a graph ➢ Auto-generate Cypher ➢ Using a logical model ➢ Using field mapping

SLIDE 16

Plugins: Check Neo4j Connection

➢ Job Entry ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup

SLIDE 17

Plugins: Neo4j Cypher Script

➢ Job Entry ➢ Executes series of Cypher statements

SLIDE 18

Neo4j Generate CSVs

➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names

SLIDE 19

Neo4j Split Graph

➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation

SLIDE 20

Neo4j Importer

➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files

SLIDE 21

Kettle using Neo4j

SLIDE 22

Using Neo4j in Kettle : Logging

➢ Write logging to Neo4j ➢ Builds an execution lineage graph ➢ Updates a metadata graph ➢ Execution details are stored on Job, Job entry, Transformation, Steps, Database levels ➢ Stores graph updates ○ Node creation or update ○ Relationship creation or update

SLIDE 23

Documents the execution process

○ Log text, times, lineage

Using Neo4j in Kettle : Logging

SLIDE 24

Using Neo4j in Kettle : Logging

➢ Examine past executions ○ See what went wrong over the weekend ○ Click on a step to see how long it took ○ Examine log texts ○ Generate Cypher queries to examine further ➢ Calculate delta window Take last execution without error into account

SLIDE 25

Using Neo4j in Kettle : Logging

➢ Top-to-bottom : find an error ○ Large jobs are hard to debug ○ Sub-jobs and sub-transformations obfuscate ○ Going through logging takes time ○ We know the loaded job or transformation ○ Neo4j can find the shortest path to the lowest execution node without children with errors>0 ○ We can show these shortest paths to the error ○ The user knows in seconds where the error happened and go straight to it to fix.

SLIDE 26

SLIDE 27

SLIDE 28

SLIDE 29

Using Neo4j in Kettle : Logging

➢ Bottom-up : how was a component executed ○ We know the step or job entry selected ○ Neo4j can find the shortest path to the root execution node without parents ○ We can show these execution paths ○ The user knows how something was executed ○ Very useful in highly dynamic conditional executions

SLIDE 30

SLIDE 31

Using Neo4j in Kettle : Logging

➢ Examining executions with browser or Bloom ➢ What exactly executed what, how, when, …? ➢ We generate Cypher for Neo4j beginners ➢ Fun Neo4j learning path for Kettle users

SLIDE 32

Other data for this audit graph...

➢ Data profiling ➢ Git branches and commit history graph ➢ Transformation unit testing results ➢ Transformation data lineage information ➢ … ➢ Coming soon

SLIDE 33

Examples

SLIDE 34

Kettle: Quick Spoon intro

SLIDE 35

Loading Neo4j: loading nodes

➢ Demonstrates the Neo4j Output step ➢ Read a CSV file in parallel ➢ Load the data into nodes in parallel

SLIDE 36

Loading Neo4j: update graphs

➢ Demonstrates the Neo4j Graph Output step ➢ Updates multiple nodes and relationships at once ➢ Takes key values into account to ignore nodes ➢ Automatically generates MERGE statements

SLIDE 37

Sourcing Neo4j: simple reading

➢ Read using a Cypher query ➢ Write to an Excel file

SLIDE 38

To wrap up...

SLIDE 39

Take-aways

Data Integration for Neo4j using Kettle: ➢ Work faster, tackle harder problems ➢ Reduce risk by showing results faster ➢ Govern your Neo4j solutions using Neo4j

SLIDE 40

Upcoming

Kettle Community Meetup ➢ → kcm19.be ➢ Antwerp ➢ Saturday November 23rd

SLIDE 41

Join our slack

kettle-community.slack.com ➢ Mail me for an invite: matt.casters@neo4j.com

SLIDE 42

The Hunger Games

SLIDE 43

Hunger Games Questions for

"Data Integration for Neo4j using Kettle"

1. Easy: Can you extract information from relational databases using Kettle?

a. No b. Yes but only a few c. Yes, almost all of them

2. Medium: Can I script harder parts of my data orchestration work?

a. No, Kettle is a visual programming tool b. Yes, you can use all popular scripting languages c. Yes, you can use JavaScript

3. Hard: Can Kettle work with big data resources?

a. Yes, Kettle supports native support for protocols like S3, HDFS, GS and others. b. Yes a) + Kettle also supports visual Map/Reduce development c. Yes b) + Kettle also support execution on the Spark, Flink and DataFlow engines

Answer here: r.neo4j.com/hunger-games

SLIDE 44

Data Integration for Neo4j using Kettle

Topics

➢ What is Kettle? ➢ Kettle plugins for Neo4j ➢ Kettle using Neo4j ➢ Examples ➢ The Hunger Games ➢ Q&A

What is Kettle?

Kettle: Introduction

Kettle: where is it used?

Kettle: Why is it used?

Kettle: Architecture

Kettle: Design

Other Kettle options available to you...

Kettle: The Toolset

Architecture

Kettle plugins for Neo4j

Plugins: Neo4j Cypher

Plugins: Neo4j Output

Plugins: Neo4j Graph Output

Plugins: Check Neo4j Connection

Plugins: Neo4j Cypher Script

Neo4j Generate CSVs

Neo4j Split Graph

Neo4j Importer

Kettle using Neo4j

Using Neo4j in Kettle : Logging

○ Log text, times, lineage

Using Neo4j in Kettle : Logging

Using Neo4j in Kettle : Logging

Using Neo4j in Kettle : Logging

Using Neo4j in Kettle : Logging

Using Neo4j in Kettle : Logging

Other data for this audit graph...

Examples

Kettle: Quick Spoon intro

Loading Neo4j: loading nodes

Loading Neo4j: update graphs

Sourcing Neo4j: simple reading

To wrap up...

Take-aways

Upcoming

Join our slack

The Hunger Games

"Data Integration for Neo4j using Kettle"

Kettle & Neo4j Q&A