[PPT] - Revealing Elasticsearch Implementation, Integration, and Execution PowerPoint Presentation

SLIDE 1

Revealing Elasticsearch

Implementation, Integration, and Execution

SLIDE 2

Objective: Get access to a cluster, index documents, find them, and present them.

SLIDE 3

Target audience

Web developers
Data scientists
Report developers
Technologists
Infrastructure/DevOPS

SLIDE 4

What is Elasticsearch?

○ Written in Java ■ Open source ■ Cross platform ○ Based on Lucene and Apache Solr ○ Scaled, real-time search & analytics ○ Full RESTful API ○ Plugin ecosystem ○ SDKs for Java, .NET, many more ○ Eventually consistent

SLIDE 5

An Elastic Timeline

SLIDE 6

2010 2011 2012 2013 2014 2015 2016 2017

Elasticsearch History

0.x 1.x $104M in funding 5.x

Elastic Cloud

2.x

Prelert

SLIDE 7

Getting Started

SLIDE 8

Objective: All you need is an endpoint

http://localhost:9200/_search

SLIDE 9

Getting Out of the Gates

Option 1 (*Ix)

Apt-get the latest version of Elasticsearch (5.2.1) from elastic.co Run bin/elasticsearch Curl http://localhost:9200

Option 2 (Windows)

Download the latest version of Elasticsearch from elastic.co Run bin\elasticsearch.bat

Option 3 (Cloud)

Create a free account with Elastic.co Create a free account with Amazon Web Services Many other providers

SLIDE 10

Cluster Overview

SLIDE 11

Objective: Understand how data is stored and transactions are scaled

SLIDE 12

Standard Configuration

A typical production cluster will contain 3

nodes (installations) ○ Additional nodes can be brought

nline through discovery
A typical node will contain 5 primary

shards and 5 replica shards

Data is replicated across all nodes so loss
f a node will not affect cluster
A master node is commonly specified to

handle routing of requests

Data is also serialized to disk and can be

recovered

SLIDE 13

Storage: A cluster with 3 nodes of 32GB RAM machines has 32GB of cache.

SLIDE 14

Questions?

SLIDE 15

Indexing Data

SLIDE 16

Objective: All you need is Postman

SLIDE 17

Inverted Indexes

Elasticsearch uses a structure called an inverted index

Find all the unique words that appear in document
List documents in which word (token) appears

Find all documents in which token exists

Reduces total search size
Ranks documents based on occurrences

Word stemming & casing

Cases are removed in tokens
Stemming algorithm drops “ing”, “ly”, “s”, etc

Normalization

All inverted indexes are normalized
Custom analyzers can be applied to documents

SLIDE 18

Mappings

Available types:

Boolean
Long
Double
Date
String

Elasticsearch Mapping

Elasticsearch will attempt to “guess” type mappings as each document is indexed. Once created, mappings cannot be changed without re-creating the index. A custom mapping can be applied before indexing documents.

SLIDE 19

Analyzers

None
Standard

○ Splits the input text on word boundaries ○ Terms are lower cased

Whitespace

○ Breaks text into terms whenever it encounters a whitespace character

Simple

○ Breaks text into terms whenever it encounters a character which is not a letter ○ Terms are lower cased

Language

○ 33+ languages supported ○ Stems words based on language ○ Removes language specific “stop” words

Custom

○ E.g. Remove “stop” words using a language filter

SLIDE 20

Patient Document Example

{ "patient": { "first_name": "John", "last_name": "Doe", "dob": 252507600000, "gender": "Male", "race": "White", "height": 1.8288, "weight": 90.7185, "eyes": "blue", "hair": "brown", "age": 39, "tobacco": "no", "location": { "lat": 40.762446, "lon": -73.831653 }, "conditions": [{ "icd10": "M54.5", "description": "Low back pain" }, { "icd10": "Z91.018", "description": "Allergy to other foods" }], "medications": [{ "name": "Aspirin", "dosage": 150, "units": "mg", "frequency": 8, "freq_units": "hours" }] } }

JSON format (Javascript Object Notation)
Index by PUTting document to index endpoint

○ (PUT patients/patient/1) ○ Last item is unique key (1)

Index operation automatically creates an index

if it has not been created before

Elasticsearch “guesses” types as they are

posted

Each indexed document is given a version

number

Index API optionally allows for optimistic

concurrency control when the version parameter is specified

Bulk-indexing supported (Bulk API)
River plugins (Oracle, MSSQL, MySQL)

SLIDE 21

Questions?

SLIDE 22

Querying Documents

SLIDE 23

Objective: All you need is JSON

SLIDE 24

QueryDSL Domain-Specific Language

Leaf query clauses

○ Leaf query clauses look for a particular value in a particular field, such as the match, term or range

queries. These queries can be used by themselves.
Compound query clauses

○ Compound query clauses wrap other leaf or compound queries and are used to combine multiple queries in a logical fashion (such as the bool or dis_max query), or to alter their behavior (such as the constant_score query).

SLIDE 25

Common Query Types

Full Text

○ Match All ○ Query String

Term

○ Term ○ Range ○ Exists ○ Regexp ○ Fuzzy

Compound

○ Bool ○ Boosting

Joining

○ Nested

Geo

○ Geo Shape ○ Geo Distance ○ Geo Polygon

Specialized

○ More Like This ○ Template ○ Script

SLIDE 26

Sample Bool Query

{ "query": { "bool": { "must": [{ "match": { "medications.name": "Aspirin" } }], "filter": [{ "term": { "gender": "Male" } }, { "range": { "age": { "lte": 50, "gte": 30 } } }] } } }

JSON format (Javascript Object

Notation)

Search by performing GET

against a specific index ○ /GET patients/_search

This query returns all men

between the ages of 30 and 50 who use aspirin

SLIDE 27

Query Result

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.3862944, "hits": [{ "_index": "patients", "_type": "patient", "_id": "1", "_score": 1.3862944, "_source": { "first_name": "John", "last_name": "Doe", "dob": 252507600000, . . . } }] } }

Query returns a formatted JSON result indicating the search metrics

Took

○ Length of time in milliseconds the query took to execute and return

Shards

○ Number of shards utilized in execution

f the query
Hits

○ Total and max score of all results ○ Hits[] is an array of resulting documents, which can be limited by size

SLIDE 28

Aggregates

{ "query": { "bool": { "must": [{ "match": { "gender": "Male" } }] }, "aggs": { "medications": { "terms": { "field": "medications.name" } } } } }

An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents.

Bucketing A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. Metric Aggregations that keep track and compute metrics over a set of documents. Matrix A family of aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Pipeline Aggregations that aggregate the output of other aggregations and their associated metrics

SLIDE 29

Query Result

{ . . . "aggregations" : { "medications" : { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets" : [ { "key" : "Aspirin", "doc_count" : 2465 }, { "key" : "Omeprazole", "doc_count" : 1824 }, { "key" : "Lisinopril", "doc_count" : 1121 }, ] } } }

A bucket aggregation finds all documents matching the query (in this case all males) and aggregates the results into key and doc_count fields. Only documents matching the initial query will be considered for aggregation.

SLIDE 30

Statistical Aggregates

{ "query": { "bool": { "must": [{ "match": { "gender": "Male" } }] }, "aggs": { "age_stats": { "extended_stats": { "field": "age" } } } } }

The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being

aggregated. The values are typically

extracted from the fields of the document (using the field data), but can also be generated using scripts.

SLIDE 31

Query Result

{ "aggregations": { "age_stats": { "count": 10000, "min": 1, "max": 104, "avg": 64, "sum": 156478, "sum_of_squares": 467028, "variance": 51.55555555555556, "std_deviation": 7.180219742846005, "std_deviation_bounds": { "upper": 100.36043948569201, "lower": 71.63956051430799 } } } }

By default, the extended_stats metric will return an

bject called std_deviation_bounds, which provides

an interval of plus/minus two standard deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example three standard deviations, you can set sigma in the request, e.g. “sigma”: 3. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading.

SLIDE 32

Questions?

SLIDE 33

Use Cases

SLIDE 34

The Big Players

eBay

Searching across 800 million listings in subseconds

USAA

Securing USAA's entire internal network and application portfolio

Thomson Reuters

Driving better research, analysis, and journalism

Stack Exchange

Powering their search experience across anything from core to careers code.

SLIDE 35

Sample Applications

Geographic Regions

Patients within geographic areas (map API)
Demographic breakdowns by geographic areas

Historical Trends

Date histograms (HighCharts)
Extended statistical queries

Real-Time Search

Fast enough for type-ahead
Fuzzy matching and “more like this”

Map/Reduce

Apache Hadoop plug-in
Custom script filters & calculated fields

SLIDE 36

SLIDE 37

SLIDE 38

SLIDE 39

Revealing Elasticsearch Implementation, Integration, and Execution - - PowerPoint PPT Presentation

Revealing Elasticsearch

Objective: Get access to a cluster, index documents, find them, and present them.

Target audience

What is Elasticsearch?

An Elastic Timeline

Elasticsearch History

Getting Started

Objective: All you need is an endpoint

http://localhost:9200/_search

Getting Out of the Gates

Cluster Overview

Objective: Understand how data is stored and transactions are scaled

Standard Configuration

Storage: A cluster with 3 nodes of 32GB RAM machines has 32GB of cache.

Questions?

Indexing Data

Objective: All you need is Postman

Inverted Indexes

Mappings

Available types:

Analyzers

Patient Document Example

Questions?

Querying Documents

Objective: All you need is JSON

QueryDSL Domain-Specific Language

Common Query Types

Sample Bool Query

Query Result

Aggregates

Query Result

Statistical Aggregates

Query Result

Questions?

Use Cases

The Big Players

Sample Applications

Thank you!