Revealing Elasticsearch Implementation, Integration, and Execution - - PowerPoint PPT Presentation

revealing elasticsearch
SMART_READER_LITE
LIVE PREVIEW

Revealing Elasticsearch Implementation, Integration, and Execution - - PowerPoint PPT Presentation

Revealing Elasticsearch Implementation, Integration, and Execution Objective: Get access to a cluster, index documents, find them, and present them. Web developers Data scientists Target audience Report developers


slide-1
SLIDE 1

Revealing Elasticsearch

Implementation, Integration, and Execution

slide-2
SLIDE 2

Objective: Get access to a cluster, index documents, find them, and present them.

slide-3
SLIDE 3

Target audience

  • Web developers
  • Data scientists
  • Report developers
  • Technologists
  • Infrastructure/DevOPS
slide-4
SLIDE 4

What is Elasticsearch?

○ Written in Java ■ Open source ■ Cross platform ○ Based on Lucene and Apache Solr ○ Scaled, real-time search & analytics ○ Full RESTful API ○ Plugin ecosystem ○ SDKs for Java, .NET, many more ○ Eventually consistent

slide-5
SLIDE 5

An Elastic Timeline

slide-6
SLIDE 6

2010 2011 2012 2013 2014 2015 2016 2017

Elasticsearch History

0.x 1.x $104M in funding 5.x

Elastic Cloud

2.x

Prelert

slide-7
SLIDE 7

Getting Started

slide-8
SLIDE 8

Objective: All you need is an endpoint

http://localhost:9200/_search

slide-9
SLIDE 9

Getting Out of the Gates

Option 1 (*Ix)

Apt-get the latest version of Elasticsearch (5.2.1) from elastic.co Run bin/elasticsearch Curl http://localhost:9200

Option 2 (Windows)

Download the latest version of Elasticsearch from elastic.co Run bin\elasticsearch.bat

Option 3 (Cloud)

Create a free account with Elastic.co Create a free account with Amazon Web Services Many other providers

slide-10
SLIDE 10

Cluster Overview

slide-11
SLIDE 11

Objective: Understand how data is stored and transactions are scaled

slide-12
SLIDE 12

Standard Configuration

  • A typical production cluster will contain 3

nodes (installations) ○ Additional nodes can be brought

  • nline through discovery
  • A typical node will contain 5 primary

shards and 5 replica shards

  • Data is replicated across all nodes so loss
  • f a node will not affect cluster
  • A master node is commonly specified to

handle routing of requests

  • Data is also serialized to disk and can be

recovered

slide-13
SLIDE 13

Storage: A cluster with 3 nodes of 32GB RAM machines has 32GB of cache.

slide-14
SLIDE 14

Questions?

slide-15
SLIDE 15

Indexing Data

slide-16
SLIDE 16

Objective: All you need is Postman

slide-17
SLIDE 17

Inverted Indexes

Elasticsearch uses a structure called an inverted index

  • Find all the unique words that appear in document
  • List documents in which word (token) appears

Find all documents in which token exists

  • Reduces total search size
  • Ranks documents based on occurrences

Word stemming & casing

  • Cases are removed in tokens
  • Stemming algorithm drops “ing”, “ly”, “s”, etc

Normalization

  • All inverted indexes are normalized
  • Custom analyzers can be applied to documents
slide-18
SLIDE 18

Mappings

Available types:

  • Boolean
  • Long
  • Double
  • Date
  • String

Elasticsearch Mapping

Elasticsearch will attempt to “guess” type mappings as each document is indexed. Once created, mappings cannot be changed without re-creating the index. A custom mapping can be applied before indexing documents.

slide-19
SLIDE 19

Analyzers

  • None
  • Standard

○ Splits the input text on word boundaries ○ Terms are lower cased

  • Whitespace

○ Breaks text into terms whenever it encounters a whitespace character

  • Simple

○ Breaks text into terms whenever it encounters a character which is not a letter ○ Terms are lower cased

  • Language

○ 33+ languages supported ○ Stems words based on language ○ Removes language specific “stop” words

  • Custom

○ E.g. Remove “stop” words using a language filter

slide-20
SLIDE 20

Patient Document Example

{ "patient": { "first_name": "John", "last_name": "Doe", "dob": 252507600000, "gender": "Male", "race": "White", "height": 1.8288, "weight": 90.7185, "eyes": "blue", "hair": "brown", "age": 39, "tobacco": "no", "location": { "lat": 40.762446, "lon": -73.831653 }, "conditions": [{ "icd10": "M54.5", "description": "Low back pain" }, { "icd10": "Z91.018", "description": "Allergy to other foods" }], "medications": [{ "name": "Aspirin", "dosage": 150, "units": "mg", "frequency": 8, "freq_units": "hours" }] } }

  • JSON format (Javascript Object Notation)
  • Index by PUTting document to index endpoint

○ (PUT patients/patient/1) ○ Last item is unique key (1)

  • Index operation automatically creates an index

if it has not been created before

  • Elasticsearch “guesses” types as they are

posted

  • Each indexed document is given a version

number

  • Index API optionally allows for optimistic

concurrency control when the version parameter is specified

  • Bulk-indexing supported (Bulk API)
  • River plugins (Oracle, MSSQL, MySQL)
slide-21
SLIDE 21

Questions?

slide-22
SLIDE 22

Querying Documents

slide-23
SLIDE 23

Objective: All you need is JSON

slide-24
SLIDE 24

QueryDSL Domain-Specific Language

  • Leaf query clauses

○ Leaf query clauses look for a particular value in a particular field, such as the match, term or range

  • queries. These queries can be used by themselves.
  • Compound query clauses

○ Compound query clauses wrap other leaf or compound queries and are used to combine multiple queries in a logical fashion (such as the bool or dis_max query), or to alter their behavior (such as the constant_score query).

slide-25
SLIDE 25

Common Query Types

  • Full Text

○ Match All ○ Query String

  • Term

○ Term ○ Range ○ Exists ○ Regexp ○ Fuzzy

  • Compound

○ Bool ○ Boosting

  • Joining

○ Nested

  • Geo

○ Geo Shape ○ Geo Distance ○ Geo Polygon

  • Specialized

○ More Like This ○ Template ○ Script

slide-26
SLIDE 26

Sample Bool Query

{ "query": { "bool": { "must": [{ "match": { "medications.name": "Aspirin" } }], "filter": [{ "term": { "gender": "Male" } }, { "range": { "age": { "lte": 50, "gte": 30 } } }] } } }

  • JSON format (Javascript Object

Notation)

  • Search by performing GET

against a specific index ○ /GET patients/_search

  • This query returns all men

between the ages of 30 and 50 who use aspirin

slide-27
SLIDE 27

Query Result

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.3862944, "hits": [{ "_index": "patients", "_type": "patient", "_id": "1", "_score": 1.3862944, "_source": { "first_name": "John", "last_name": "Doe", "dob": 252507600000, . . . } }] } }

Query returns a formatted JSON result indicating the search metrics

  • Took

○ Length of time in milliseconds the query took to execute and return

  • Shards

○ Number of shards utilized in execution

  • f the query
  • Hits

○ Total and max score of all results ○ Hits[] is an array of resulting documents, which can be limited by size

slide-28
SLIDE 28

Aggregates

{ "query": { "bool": { "must": [{ "match": { "gender": "Male" } }] }, "aggs": { "medications": { "terms": { "field": "medications.name" } } } } }

An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents.

Bucketing A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. Metric Aggregations that keep track and compute metrics over a set of documents. Matrix A family of aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Pipeline Aggregations that aggregate the output of other aggregations and their associated metrics

slide-29
SLIDE 29

Query Result

{ . . . "aggregations" : { "medications" : { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets" : [ { "key" : "Aspirin", "doc_count" : 2465 }, { "key" : "Omeprazole", "doc_count" : 1824 }, { "key" : "Lisinopril", "doc_count" : 1121 }, ] } } }

A bucket aggregation finds all documents matching the query (in this case all males) and aggregates the results into key and doc_count fields. Only documents matching the initial query will be considered for aggregation.

slide-30
SLIDE 30

Statistical Aggregates

{ "query": { "bool": { "must": [{ "match": { "gender": "Male" } }] }, "aggs": { "age_stats": { "extended_stats": { "field": "age" } } } } }

The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being

  • aggregated. The values are typically

extracted from the fields of the document (using the field data), but can also be generated using scripts.

slide-31
SLIDE 31

Query Result

{ "aggregations": { "age_stats": { "count": 10000, "min": 1, "max": 104, "avg": 64, "sum": 156478, "sum_of_squares": 467028, "variance": 51.55555555555556, "std_deviation": 7.180219742846005, "std_deviation_bounds": { "upper": 100.36043948569201, "lower": 71.63956051430799 } } } }

By default, the extended_stats metric will return an

  • bject called std_deviation_bounds, which provides

an interval of plus/minus two standard deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example three standard deviations, you can set sigma in the request, e.g. “sigma”: 3. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading.

slide-32
SLIDE 32

Questions?

slide-33
SLIDE 33

Use Cases

slide-34
SLIDE 34

The Big Players

eBay

Searching across 800 million listings in subseconds

USAA

Securing USAA's entire internal network and application portfolio

Thomson Reuters

Driving better research, analysis, and journalism

Stack Exchange

Powering their search experience across anything from core to careers code.

slide-35
SLIDE 35

Sample Applications

Geographic Regions

  • Patients within geographic areas (map API)
  • Demographic breakdowns by geographic areas

Historical Trends

  • Date histograms (HighCharts)
  • Extended statistical queries

Real-Time Search

  • Fast enough for type-ahead
  • Fuzzy matching and “more like this”

Map/Reduce

  • Apache Hadoop plug-in
  • Custom script filters & calculated fields
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39

Thank you!