Data Integration for Neo4j using Kettle Matt Casters, - - PowerPoint PPT Presentation

data integration for neo4j using kettle
SMART_READER_LITE
LIVE PREVIEW

Data Integration for Neo4j using Kettle Matt Casters, - - PowerPoint PPT Presentation

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect Topics What is Kettle? Kettle plugins for Neo4j Kettle using Neo4j Examples The Hunger Games


slide-1
SLIDE 1

Data Integration for Neo4j using Kettle

Matt Casters, matt.casters@neo4j.com

mattcasters Neo4j Chief Solutions Architect

slide-2
SLIDE 2

Topics

➢ What is Kettle? ➢ Kettle plugins for Neo4j ➢ Kettle using Neo4j ➢ Examples ➢ The Hunger Games ➢ Q&A

slide-3
SLIDE 3

What is Kettle?

3

slide-4
SLIDE 4

Kettle: Introduction

➢ A visual programming tool for data orchestration ➢ A.k.a. Pentaho Data Integration from Hitachi Vantara ➢ Over 15 years old ➢ Open source under Apache Public License 2.0 ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download your Neo4j remix at www.kettle.be

slide-5
SLIDE 5

Kettle: where is it used?

➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pi sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture

slide-6
SLIDE 6

Kettle: Why is it used?

➢ Reduce costs, reach goals faster ➢ Answers the “build or buy?” question

build b u y Time Accum. Cost Kettle

slide-7
SLIDE 7

Kettle: Architecture

➢ Metadata driven, engine based : ○ No code generation

○ Define what you need to happen

→ GUI, Web, code, rules, …

○ Clear and transparent, self documenting

➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming

slide-8
SLIDE 8

Kettle: Design

➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging / breakpoints, data previewing, row sniff testing, …

slide-9
SLIDE 9

Other Kettle options available to you...

➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Native file system protocols: hdfs://, s3://, gs:// … ➢ Hadoop support through a compatibility layer ➢ Kettle Beam: execute transformations on Apache Spark, Apache Flink and GCP DataFlow

slide-10
SLIDE 10

Kettle: The Toolset

➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...

slide-11
SLIDE 11

Architecture

Version Control System git

Checkout version

Deploy System VM, docker, ...

Configure - setup - initialize - run Artifacts, graphs, configurations

slide-12
SLIDE 12

Kettle plugins for Neo4j

12

slide-13
SLIDE 13

Plugins: Neo4j Cypher

➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parallel execution ➢ High performance ➢ Call procedures

slide-14
SLIDE 14

Plugins: Neo4j Output

➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Parallel execution ➢ Dynamic labels

slide-15
SLIDE 15

Plugins: Neo4j Graph Output

➢ Update parts of a graph ➢ Auto-generate Cypher ➢ Using a logical model ➢ Using field mapping

slide-16
SLIDE 16

Plugins: Check Neo4j Connection

➢ Job Entry ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup

slide-17
SLIDE 17

Plugins: Neo4j Cypher Script

➢ Job Entry ➢ Executes series of Cypher statements

slide-18
SLIDE 18

Neo4j Generate CSVs

➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names

slide-19
SLIDE 19

Neo4j Split Graph

➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation

slide-20
SLIDE 20

Neo4j Importer

➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files

slide-21
SLIDE 21

Kettle using Neo4j

21

slide-22
SLIDE 22

Using Neo4j in Kettle : Logging

➢ Write logging to Neo4j ➢ Builds an execution lineage graph ➢ Updates a metadata graph ➢ Execution details are stored on Job, Job entry, Transformation, Steps, Database levels ➢ Stores graph updates ○ Node creation or update ○ Relationship creation or update

slide-23
SLIDE 23
  • Documents the execution process

○ Log text, times, lineage

Using Neo4j in Kettle : Logging

slide-24
SLIDE 24

Using Neo4j in Kettle : Logging

➢ Examine past executions ○ See what went wrong over the weekend ○ Click on a step to see how long it took ○ Examine log texts ○ Generate Cypher queries to examine further ➢ Calculate delta window Take last execution without error into account

slide-25
SLIDE 25

Using Neo4j in Kettle : Logging

➢ Top-to-bottom : find an error ○ Large jobs are hard to debug ○ Sub-jobs and sub-transformations obfuscate ○ Going through logging takes time ○ We know the loaded job or transformation ○ Neo4j can find the shortest path to the lowest execution node without children with errors>0 ○ We can show these shortest paths to the error ○ The user knows in seconds where the error happened and go straight to it to fix.

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

Using Neo4j in Kettle : Logging

➢ Bottom-up : how was a component executed ○ We know the step or job entry selected ○ Neo4j can find the shortest path to the root execution node without parents ○ We can show these execution paths ○ The user knows how something was executed ○ Very useful in highly dynamic conditional executions

slide-30
SLIDE 30
slide-31
SLIDE 31

Using Neo4j in Kettle : Logging

➢ Examining executions with browser or Bloom ➢ What exactly executed what, how, when, …? ➢ We generate Cypher for Neo4j beginners ➢ Fun Neo4j learning path for Kettle users

slide-32
SLIDE 32

Other data for this audit graph...

➢ Data profiling ➢ Git branches and commit history graph ➢ Transformation unit testing results ➢ Transformation data lineage information ➢ … ➢ Coming soon

slide-33
SLIDE 33

Examples

33

slide-34
SLIDE 34

Kettle: Quick Spoon intro

slide-35
SLIDE 35

Loading Neo4j: loading nodes

➢ Demonstrates the Neo4j Output step ➢ Read a CSV file in parallel ➢ Load the data into nodes in parallel

slide-36
SLIDE 36

Loading Neo4j: update graphs

➢ Demonstrates the Neo4j Graph Output step ➢ Updates multiple nodes and relationships at once ➢ Takes key values into account to ignore nodes ➢ Automatically generates MERGE statements

slide-37
SLIDE 37

Sourcing Neo4j: simple reading

➢ Read using a Cypher query ➢ Write to an Excel file

slide-38
SLIDE 38

To wrap up...

38

slide-39
SLIDE 39

Take-aways

Data Integration for Neo4j using Kettle: ➢ Work faster, tackle harder problems ➢ Reduce risk by showing results faster ➢ Govern your Neo4j solutions using Neo4j

slide-40
SLIDE 40

Upcoming

Kettle Community Meetup ➢ → kcm19.be ➢ Antwerp ➢ Saturday November 23rd

slide-41
SLIDE 41

Join our slack

kettle-community.slack.com ➢ Mail me for an invite: matt.casters@neo4j.com

slide-42
SLIDE 42

The Hunger Games

42

slide-43
SLIDE 43

Hunger Games Questions for

"Data Integration for Neo4j using Kettle"

1. Easy: Can you extract information from relational databases using Kettle?

a. No b. Yes but only a few c. Yes, almost all of them

2. Medium: Can I script harder parts of my data orchestration work?

a. No, Kettle is a visual programming tool b. Yes, you can use all popular scripting languages c. Yes, you can use JavaScript

3. Hard: Can Kettle work with big data resources?

a. Yes, Kettle supports native support for protocols like S3, HDFS, GS and others. b. Yes a) + Kettle also supports visual Map/Reduce development c. Yes b) + Kettle also support execution on the Spark, Flink and DataFlow engines

Answer here: r.neo4j.com/hunger-games

slide-44
SLIDE 44

Kettle & Neo4j Q&A

44