Revealing Elasticsearch
Implementation, Integration, and Execution
Revealing Elasticsearch Implementation, Integration, and Execution - - PowerPoint PPT Presentation
Revealing Elasticsearch Implementation, Integration, and Execution Objective: Get access to a cluster, index documents, find them, and present them. Web developers Data scientists Target audience Report developers
Implementation, Integration, and Execution
○ Written in Java ■ Open source ■ Cross platform ○ Based on Lucene and Apache Solr ○ Scaled, real-time search & analytics ○ Full RESTful API ○ Plugin ecosystem ○ SDKs for Java, .NET, many more ○ Eventually consistent
2010 2011 2012 2013 2014 2015 2016 2017
0.x 1.x $104M in funding 5.x
Elastic Cloud
2.x
Prelert
Option 1 (*Ix)
Apt-get the latest version of Elasticsearch (5.2.1) from elastic.co Run bin/elasticsearch Curl http://localhost:9200
Option 2 (Windows)
Download the latest version of Elasticsearch from elastic.co Run bin\elasticsearch.bat
Option 3 (Cloud)
Create a free account with Elastic.co Create a free account with Amazon Web Services Many other providers
nodes (installations) ○ Additional nodes can be brought
shards and 5 replica shards
handle routing of requests
recovered
Elasticsearch uses a structure called an inverted index
Find all documents in which token exists
Word stemming & casing
Normalization
Elasticsearch Mapping
Elasticsearch will attempt to “guess” type mappings as each document is indexed. Once created, mappings cannot be changed without re-creating the index. A custom mapping can be applied before indexing documents.
○ Splits the input text on word boundaries ○ Terms are lower cased
○ Breaks text into terms whenever it encounters a whitespace character
○ Breaks text into terms whenever it encounters a character which is not a letter ○ Terms are lower cased
○ 33+ languages supported ○ Stems words based on language ○ Removes language specific “stop” words
○ E.g. Remove “stop” words using a language filter
{ "patient": { "first_name": "John", "last_name": "Doe", "dob": 252507600000, "gender": "Male", "race": "White", "height": 1.8288, "weight": 90.7185, "eyes": "blue", "hair": "brown", "age": 39, "tobacco": "no", "location": { "lat": 40.762446, "lon": -73.831653 }, "conditions": [{ "icd10": "M54.5", "description": "Low back pain" }, { "icd10": "Z91.018", "description": "Allergy to other foods" }], "medications": [{ "name": "Aspirin", "dosage": 150, "units": "mg", "frequency": 8, "freq_units": "hours" }] } }
○ (PUT patients/patient/1) ○ Last item is unique key (1)
if it has not been created before
posted
number
concurrency control when the version parameter is specified
○ Leaf query clauses look for a particular value in a particular field, such as the match, term or range
○ Compound query clauses wrap other leaf or compound queries and are used to combine multiple queries in a logical fashion (such as the bool or dis_max query), or to alter their behavior (such as the constant_score query).
○ Match All ○ Query String
○ Term ○ Range ○ Exists ○ Regexp ○ Fuzzy
○ Bool ○ Boosting
○ Nested
○ Geo Shape ○ Geo Distance ○ Geo Polygon
○ More Like This ○ Template ○ Script
{ "query": { "bool": { "must": [{ "match": { "medications.name": "Aspirin" } }], "filter": [{ "term": { "gender": "Male" } }, { "range": { "age": { "lte": 50, "gte": 30 } } }] } } }
Notation)
against a specific index ○ /GET patients/_search
between the ages of 30 and 50 who use aspirin
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.3862944, "hits": [{ "_index": "patients", "_type": "patient", "_id": "1", "_score": 1.3862944, "_source": { "first_name": "John", "last_name": "Doe", "dob": 252507600000, . . . } }] } }
Query returns a formatted JSON result indicating the search metrics
○ Length of time in milliseconds the query took to execute and return
○ Number of shards utilized in execution
○ Total and max score of all results ○ Hits[] is an array of resulting documents, which can be limited by size
{ "query": { "bool": { "must": [{ "match": { "gender": "Male" } }] }, "aggs": { "medications": { "terms": { "field": "medications.name" } } } } }
An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents.
Bucketing A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. Metric Aggregations that keep track and compute metrics over a set of documents. Matrix A family of aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Pipeline Aggregations that aggregate the output of other aggregations and their associated metrics
{ . . . "aggregations" : { "medications" : { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets" : [ { "key" : "Aspirin", "doc_count" : 2465 }, { "key" : "Omeprazole", "doc_count" : 1824 }, { "key" : "Lisinopril", "doc_count" : 1121 }, ] } } }
A bucket aggregation finds all documents matching the query (in this case all males) and aggregates the results into key and doc_count fields. Only documents matching the initial query will be considered for aggregation.
{ "query": { "bool": { "must": [{ "match": { "gender": "Male" } }] }, "aggs": { "age_stats": { "extended_stats": { "field": "age" } } } } }
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being
extracted from the fields of the document (using the field data), but can also be generated using scripts.
{ "aggregations": { "age_stats": { "count": 10000, "min": 1, "max": 104, "avg": 64, "sum": 156478, "sum_of_squares": 467028, "variance": 51.55555555555556, "std_deviation": 7.180219742846005, "std_deviation_bounds": { "upper": 100.36043948569201, "lower": 71.63956051430799 } } } }
By default, the extended_stats metric will return an
an interval of plus/minus two standard deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example three standard deviations, you can set sigma in the request, e.g. “sigma”: 3. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading.
eBay
Searching across 800 million listings in subseconds
USAA
Securing USAA's entire internal network and application portfolio
Thomson Reuters
Driving better research, analysis, and journalism
Stack Exchange
Powering their search experience across anything from core to careers code.
Geographic Regions
Historical Trends
Real-Time Search
Map/Reduce