CHAIR PROF. BÖHM
Compression and Similarity Indexing for Time Series
Master’s Thesis Marco Neumann | 19th of August 2016
KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
www.kit.edu
Compression and Similarity Indexing for Time Series Masters Thesis - - PowerPoint PPT Presentation
CHAIR PROF. BHM Compression and Similarity Indexing for Time Series Masters Thesis Marco Neumann | 19th of August 2016 KIT University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu
CHAIR PROF. BÖHM
Master’s Thesis Marco Neumann | 19th of August 2016
KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
www.kit.edu
1 Google 𝑜-gram data 2 Clean-up 3 Similarity 4 Baseline 5 CASINO TIMES 6 Final Words
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 2/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 3/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 4/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 5/34
slow confirmation bias choosing possible candidates is subject to frame
1for interactive analysis
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 6/34
design & evaluation of baseline design & evaluation of an own approach
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 7/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 8/34
1 string filtering:
numbers word classes
2 string normalization:
NFKC Unicode normalization lowercase
3 word normalization:
stemming lemmatisation
4 pruning:
rare words OCR errors
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 9/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 10/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 11/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 12/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 13/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 14/34
VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment.
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 15/34
1 log(𝑦 + 1) 2 Gauss-smoothing
3 gradient calculation 4 DTW with warping
pre-calculation
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 16/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 17/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 18/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 19/34
VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment.
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 20/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 21/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 22/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 23/34
speed up nn queries using an index compress data
enable subrange queries w/o re-indexing slow pre-processing, fast search use normal hardware
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 24/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 25/34
same children difference of coefficients is small (= compression error is below threshold)
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 26/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 27/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 28/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 29/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 30/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 31/34
definition of similarity fast baseline algorithm
2starting collaboration with Prof. Dr. Sanders
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 32/34
IEEE-half floating point non-IEEE data types (e.g. A-law and 𝜈-law)
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 33/34
Google 𝑜-gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 34/34
https://www.flickr.com/photos/grongar/8704148177/
[1] Rakesh Agrawal et al. „Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases“. In: Proceedings
[2]
0018-9340. DOI: 10.1109/T-C.1974.223784. [3]
[4] Lutz Bornmann and Rüdiger Mutz. „Growth rates of modern science: A bibliometric analysis“. In: CoRR abs/1402.4578 (2014). URL:
http://arxiv.org/abs/1402.4578.
[5] Kaushik Chakrabarti et al. „Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases“. In: ACM Trans. Database Syst. 27.2 (June 2002), pp. 188–228. ISSN: 0362-5915. DOI: 10.1145/568518.568520. Marco Neumann – CASINO TIMES 19th of August 2016 35/34
[6] Kin-Pong Chan and Ada Wai-Chee Fu. „Efficient time series matching by wavelets“. In: Data Engineering, 1999. Proceedings., 15th International Conference on. Mar. 1999, pp. 126–133. DOI: 10.1109/ICDE.1999.754915. [7]
Recognition (CVPR). June 2015, pp. 4259–4267. DOI: 10.1109/CVPR.2015.7299054. [8] Huamin Chen, Jian Li, and P. Mohapatra. „RACE: time series compression with rate adaptivity and error bound for sensor networks“. In: Mobile Ad-hoc and Sensor Systems, 2004 IEEE International Conference on. Oct. 2004, pp. 124–133. DOI: 10.1109/MAHSS.2004.1392089. [9] Gautam Das et al. „Rule Discovery from Time Series.“ In: KDD. Vol. 98. 1. 1998, pp. 16–22. [10] Ingrid Daubechies. Ten Lectures on Wavelets (CBMS-NSF Regional Conference Series in Applied Mathematics). SIAM: Society for Industrial and Applied Mathematics, 1992. ISBN: 0898712742. [11] Mark Davis and Ken Whistler. Unicode Standard Annex #15: Unicode Normalization Forms. 2015. URL:
http://unicode.org/reports/tr15/.
[12] Karen Egiazarian and Jaakko Astola. „Tree-Structured Haar Transforms“. In: Journal of Mathematical Imaging and Vision 16.3 (2002),
[13] Paul H. C. Eilers*. „Parametric Time Warping“. In: Analytical Chemistry 76.2 (2004), pp. 404–411. DOI: 10.1021/ac034800e. eprint:
http://dx.doi.org/10.1021/ac034800e.
[14] Michael Feindt. „A Neural Bayesian Estimator for Conditional Probability Densities“. In: (Feb. 2004). URL:
https://arxiv.org/abs/physics/0402093.
[15] Eugene Fink and Harith Suman Gandhi. „Compression of Time Series by Extracting Major Extrema“. In: J. Exp. Theor. Artif. Intell. 23.2 (June 2011), pp. 255–270. ISSN: 0952-813X. DOI: 10.1080/0952813X.2010.505800. Marco Neumann – CASINO TIMES 19th of August 2016 36/34
[16] G.711: Pulse code modulation (PCM) of voice frequencies. Geneva, Switzerland, Nov. 1988. URL:
https://www.itu.int/rec/T-REC-G.711.
[17] Fakhteh Ghanbarnejad et al. „Extracting information from S-curves of language change“. In: Journal of The Royal Society Interface 11.101 (2014). ISSN: 1742-5689. DOI: 10.1098/rsif.2014.1044. eprint:
http://rsif.royalsocietypublishing.org/content/11/101/20141044.full.pdf. URL: http://rsif.royalsocietypublishing.org/content/11/101/20141044.
[18] Igor Grossmann and Michael E. W. Varnum. „Social Structure, Infectious Diseases, Disasters, Secularism, and Cultural Change in America“. In: Psychological Science 26.3 (2015), pp. 311–324. DOI: 10.1177/0956797614563765. eprint:
http://pss.sagepub.com/content/26/3/311.full.pdf+html. URL: http://pss.sagepub.com/content/26/3/311.abstract.
[19] Antonin Guttman. „R-trees: A Dynamic Index Structure for Spatial Searching“. In: SIGMOD Rec. 14.2 (June 1984), pp. 47–57. ISSN: 0163-5808. DOI: 10.1145/971697.602266. [20] Alfred Haar. „Zur Theorie der orthogonalen Funktionensysteme“. In: Mathematische Annalen 69.3 (1910), pp. 331–371. ISSN: 1432-1807. DOI: 10.1007/BF01456326. [21] „IEEE Standard for Floating-Point Arithmetic“. In: IEEE Std 754-2008 (Aug. 2008), pp. 1–70. DOI: 10.1109/IEEESTD.2008.4610935. [22] ISO/IEC 14882:2014. Tech. rep. International Organization for Standardization, 2014. [23] Richard A. White Jeffrey J. McMillan. „Auditors’ Belief Revisions and Evidence Search: The Effect of Hypothesis Frame, Confirmation Bias, and Professional Skepticism“. In: The Accounting Review 68.3 (1993), pp. 443–465. ISSN: 00014826. [24] Eva Jonas et al. „Confirmation bias in sequential information search afuer preliminary decisions: An expansion of dissonance theoretical research on selective exposure to information.“ In: Journal of Personality and Social Psychology 80.4 (2001), pp. 557–571. ISSN: 1939-1315 (Electronic); 0022-3514 (Print). DOI: 10.1037/0022-3514.80.4.557. Marco Neumann – CASINO TIMES 19th of August 2016 37/34
[25]
[26] Fabian Keller, Emmanuel Müller, and Klemens Böhm. „Estimating Mutual Information on Data Streams“. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management. SSDBM ’15. La Jolla, California: ACM, 2015, 3:1–3:12. ISBN: 978-1-4503-3709-0. DOI: 10.1145/2791347.2791348. [27] Eamonn J. Keogh and Michael J. Pazzani. „Derivative Dynamic Time Warping“. In: In First SIAM International Conference on Data Mining (SDM’2001). 2001. [28] Eamonn Keogh, Kaushik Chakrabarti, et al. „Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases“. In: Knowledge and Information Systems 3.3 (2001), pp. 263–286. ISSN: 0219-1377. DOI: 10.1007/PL00011669. [29] Eamonn Keogh and Ann Chotirat Ratanamahatana. „Exact indexing of dynamic time warping“. In: Knowledge and Information Systems 7.3 (2005), pp. 358–386. ISSN: 0219-3116. DOI: 10.1007/s10115-004-0154-9. [30] Niveda Krishnamoorthy et al. „Generating Natural-language Video Descriptions Using Text-mined Knowledge“. In: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI’13. Bellevue, Washington: AAAI Press, 2013, pp. 541–547. [31] Joseph B Kruskal and Mark Liberman. „The symmetric time-warping problem: from continuous to discrete“. In: Time warps, string edits and macromolecules: The theory and practice of sequence comparison (1983), pp. 125–161. [32] Vivek Kulkarni et al. „Statistically Significant Detection of Linguistic Change“. In: CoRR abs/1411.3315 (2014). URL:
http://arxiv.org/abs/1411.3315.
[33] Peder Olesen Larsen and Markus von Ins. „The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index“. In: Scientometrics 84.3 (2010), pp. 575–603. ISSN: 1588-2861. DOI: 10.1007/s11192-010-0202-z. Marco Neumann – CASINO TIMES 19th of August 2016 38/34
[34] Daniel Lemire. „Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound“. In: CoRR abs/0811.3301 (2008). URL:
http://arxiv.org/abs/0811.3301.
[35] Yuri Lin et al. „Syntactic Annotations for the Google Books Ngram Corpus“. In: Proceedings of the ACL 2012 System Demonstrations. ACL ’12. Jeju Island, Korea: Association for Computational Linguistics, 2012, pp. 169–174. [36] Jack Rae Marius Muja David G. Lowe. FLANN - Fast Library for Approximate Nearest Neighbors. Version 1.8.4. URL:
http://www.cs.ubc.ca/research/flann/.
[37] Scott Meyers. Efgective C++: 55 Specific Ways to Improve Your Programs and Designs. 3rd Edition. Addison-Wesley Professional, 2005. ISBN: 978-0321334879. [38] Scott Meyers. Efgective Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14. O’Reilly Media, 2014. ISBN: 1-4919-0398-8. [39] George A. Miller. „WordNet: A Lexical Database for English“. In: Commun. ACM 38.11 (Nov. 1995), pp. 39–41. ISSN: 0001-0782. DOI:
10.1145/219717.219748.
[40] Fabian Mörchen. Time series feature extraction for data mining using DWT and DFT. Tech. rep. 2003. [41] Michael D. Morse and Jignesh M. Patel. „An Efficient and Accurate Method for Evaluating Time Series Similarity“. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. SIGMOD ’07. Beijing, China: ACM, 2007, pp. 569–580. ISBN: 978-1-59593-686-8. DOI: 10.1145/1247480.1247544. [42]
verification“. In: Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on. Vol. 1. 1999, 108–115 vol.1. DOI:
10.1109/ICCV.1999.791205.
Marco Neumann – CASINO TIMES 19th of August 2016 39/34
[43] Clifford R. Mynatt, Michael E. Doherty, and Ryan D. Tweney. „Confirmation bias in a simulated research environment: An experimental study of scientific inference“. In: Quarterly Journal of Experimental Psychology 29.1 (1977), pp. 85–95. DOI: 10.1080/00335557743000053. eprint: http://dx.doi.org/10.1080/00335557743000053. [44] Daniel Naber. Finding errors using Big Data. 2015. URL: http://wiki.languagetool.org/finding-errors-using-big-data. [45] Raymond S. Nickerson. „Confirmation bias: A ubiquitous phenomenon in many guises.“ In: Review of General Psychology 2.2 (1998),
[46] Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds. „Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution“. In: PLoS ONE 10.10 (Oct. 2015), pp. 1–24. DOI: 10.1371/journal.pone.0137041. [47] Steven T. Piantadosi. „Zipf’s word frequency law in natural language: A critical review and future directions“. In: Psychonomic Bulletin & Review 21.5 (2014), pp. 1112–1130. ISSN: 1531-5320. DOI: 10.3758/s13423-014-0585-6. [48] Martin Porter. Developing the English stemmer. 2002. URL: http://snowball.tartarus.org/algorithms/english/stemmer.html. [49]
Inc., 1990. ISBN: 0-12-580203-X. [50] Hiroaki Sakoe and Seibi Chiba. „Dynamic programming algorithm optimization for spoken word recognition“. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 1 (1978), pp. 43–49. [51] Thomas Seidl and Hans-Peter Kriegel. „Optimal Multi-step K-nearest Neighbor Search“. In: SIGMOD Rec. 27.2 (June 1998), pp. 154–165. ISSN: 0163-5808. DOI: 10.1145/276305.276319. Marco Neumann – CASINO TIMES 19th of August 2016 40/34
[52] Jin Shieh and Eamonn Keogh. „iSAX: Indexing and Mining Terabyte Sized Time Series“. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’08. Las Vegas, Nevada, USA: ACM, 2008, pp. 623–631. ISBN: 978-1-60558-193-4. DOI: 10.1145/1401890.1401966. [53] Irem Uz. „Individualism and First Person Pronoun Use in Written Texts Across Languages“. In: Journal of Cross-Cultural Psychology 45.10 (2014), pp. 1671–1678. DOI: 10.1177/0022022114550481. eprint: http://jcc.sagepub.com/content/45/10/1671.full.pdf+html. URL: http://jcc.sagepub.com/content/45/10/1671.abstract. [54] Tyler Vigen. Spurious Correlations. Hachette Books, 2015. ISBN: 978-0316339438. [55]
International Journal of Remote Sensing 13 (May 1992), pp. 1585–1590. DOI: 10.1080/01431169208904212. [56]
[57] Hui Zhang et al. „Unsupervised feature extraction for time series clustering using orthogonal wavelet transform“. In: Informatica 30.3 (2006). [58] Владимир Иосифович Левенштейн. „Двоичные коды с исправлением выпадений, вставок и замещений символов“. In: 163.4 (1965), pp. 845–848. Marco Neumann – CASINO TIMES 19th of August 2016 41/34
Algorithm 1: runCompression Data: Trees 𝑈 Data: Error threshold 𝜗
1 begin 2
for 𝑢 ∈ shuffled(𝑈) do
3
𝑓 ← 0;
4
while hasUntriedPossibleMerge(𝑢) do
5
𝑢 ← pickCheapestMerge(𝑢);
6
if 𝑓 + maxErrIncrease(𝑛) ≤ 𝜗 then
7
executeMerge(𝑛);
8
𝑓 ← 𝑓 + realErrIncrease(𝑛);
9
end
10
markTried(𝑢);
11
end
12
addToDB(𝑢);
13
end
14 end Marco Neumann – CASINO TIMES 19th of August 2016 42/34
Marco Neumann – CASINO TIMES 19th of August 2016 43/34
Marco Neumann – CASINO TIMES 19th of August 2016 44/34
Marco Neumann – CASINO TIMES 19th of August 2016 45/34
Marco Neumann – CASINO TIMES 19th of August 2016 46/34