Data Integration for Neo4j using Kettle
Matt Casters, matt.casters@neo4j.com
mattcasters Neo4j Chief Solutions Architect
Data Integration for Neo4j using Kettle Matt Casters, - - PowerPoint PPT Presentation
Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect Topics What is Kettle? Kettle plugins for Neo4j Kettle using Neo4j Examples The Hunger Games
Matt Casters, matt.casters@neo4j.com
mattcasters Neo4j Chief Solutions Architect
3
➢ A visual programming tool for data orchestration ➢ A.k.a. Pentaho Data Integration from Hitachi Vantara ➢ Over 15 years old ➢ Open source under Apache Public License 2.0 ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download your Neo4j remix at www.kettle.be
➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pi sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture
➢ Reduce costs, reach goals faster ➢ Answers the “build or buy?” question
build b u y Time Accum. Cost Kettle
➢ Metadata driven, engine based : ○ No code generation
○ Define what you need to happen
→ GUI, Web, code, rules, …
○ Clear and transparent, self documenting
➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming
➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging / breakpoints, data previewing, row sniff testing, …
➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Native file system protocols: hdfs://, s3://, gs:// … ➢ Hadoop support through a compatibility layer ➢ Kettle Beam: execute transformations on Apache Spark, Apache Flink and GCP DataFlow
➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...
Version Control System git
Checkout version
Deploy System VM, docker, ...
Configure - setup - initialize - run Artifacts, graphs, configurations
12
➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parallel execution ➢ High performance ➢ Call procedures
➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Parallel execution ➢ Dynamic labels
➢ Update parts of a graph ➢ Auto-generate Cypher ➢ Using a logical model ➢ Using field mapping
➢ Job Entry ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup
➢ Job Entry ➢ Executes series of Cypher statements
➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names
➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation
➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files
21
➢ Write logging to Neo4j ➢ Builds an execution lineage graph ➢ Updates a metadata graph ➢ Execution details are stored on Job, Job entry, Transformation, Steps, Database levels ➢ Stores graph updates ○ Node creation or update ○ Relationship creation or update
➢ Examine past executions ○ See what went wrong over the weekend ○ Click on a step to see how long it took ○ Examine log texts ○ Generate Cypher queries to examine further ➢ Calculate delta window Take last execution without error into account
➢ Top-to-bottom : find an error ○ Large jobs are hard to debug ○ Sub-jobs and sub-transformations obfuscate ○ Going through logging takes time ○ We know the loaded job or transformation ○ Neo4j can find the shortest path to the lowest execution node without children with errors>0 ○ We can show these shortest paths to the error ○ The user knows in seconds where the error happened and go straight to it to fix.
➢ Bottom-up : how was a component executed ○ We know the step or job entry selected ○ Neo4j can find the shortest path to the root execution node without parents ○ We can show these execution paths ○ The user knows how something was executed ○ Very useful in highly dynamic conditional executions
➢ Examining executions with browser or Bloom ➢ What exactly executed what, how, when, …? ➢ We generate Cypher for Neo4j beginners ➢ Fun Neo4j learning path for Kettle users
➢ Data profiling ➢ Git branches and commit history graph ➢ Transformation unit testing results ➢ Transformation data lineage information ➢ … ➢ Coming soon
33
➢ Demonstrates the Neo4j Output step ➢ Read a CSV file in parallel ➢ Load the data into nodes in parallel
➢ Demonstrates the Neo4j Graph Output step ➢ Updates multiple nodes and relationships at once ➢ Takes key values into account to ignore nodes ➢ Automatically generates MERGE statements
➢ Read using a Cypher query ➢ Write to an Excel file
38
Data Integration for Neo4j using Kettle: ➢ Work faster, tackle harder problems ➢ Reduce risk by showing results faster ➢ Govern your Neo4j solutions using Neo4j
Kettle Community Meetup ➢ → kcm19.be ➢ Antwerp ➢ Saturday November 23rd
kettle-community.slack.com ➢ Mail me for an invite: matt.casters@neo4j.com
42
Hunger Games Questions for
1. Easy: Can you extract information from relational databases using Kettle?
a. No b. Yes but only a few c. Yes, almost all of them
2. Medium: Can I script harder parts of my data orchestration work?
a. No, Kettle is a visual programming tool b. Yes, you can use all popular scripting languages c. Yes, you can use JavaScript
3. Hard: Can Kettle work with big data resources?
a. Yes, Kettle supports native support for protocols like S3, HDFS, GS and others. b. Yes a) + Kettle also supports visual Map/Reduce development c. Yes b) + Kettle also support execution on the Spark, Flink and DataFlow engines
Answer here: r.neo4j.com/hunger-games
44