Data Acquisition Axel Ngonga Lead Data Acquisition BIG Data PPF - PowerPoint PPT Presentation
Data Acquisition Axel Ngonga Lead Data Acquisition BIG Data PPF http://big-project.eu Motivation Increasing amout of data 4K new pictures on Instagram 100K tweets 800K new pieces of content on Facebook Motivation
Data Acquisition Axel Ngonga Lead Data Acquisition BIG Data PPF http://big-project.eu
Motivation ● Increasing amout of data ○ 4K new pictures on Instagram ○ 100K tweets ○ 800K new pieces of content on Facebook ○ …
Motivation
Motivation ● Big data technologies for ○ Improved business intelligence ○ Secure decisions ○ Customized services ○ … ● Use Cases ○ Mission planning ○ Trade market ○ Customized services ○ Criminality prediction ○ ...
Definition ● Data acquisition stands for ○ Selecting of data sources ○ Collection of information from these sources ○ Filtering and cleaning data
Overview DS Processing DS Storage (cleaning, classification) DS DS
More than 3 Vs ● The 9(?) Vs of Big Data Acquisition ○ Volume ○ Velocity ○ Variety ○ Vocabulary ○ Variability (security models, ownership) ○ Veracity (trustworthiness of data) ○ Visibility (integrated view of data) ○ Value (worth of data for data consumer) ○ Visualization
Requirements ● Extensibility of protocols ● High scalability of approaches ● Low memory consumption ● Parallelism ● Elasticity ● Fast ROI ● High throughput (real-time)
Technology Overview ● Gathering ○ Advanced Message Queuing Protocol ■ Wire-level protocol ■ OASIS Standard since Oct. 2012 ■ Large number of implementations incl. RabbitMQ, SwiftMQ, Apache ActiveMQ, Windows Azure Service Bus ○ JMS 2.0 ○ Kestrel (Memcached) ○ Apache Kafka ○ Apache Flume (log data) ○ FB Scribe (log data)
Technology Overview ● Processing ○ Facebook Scribe (Aggregation) ○ Twitter Storm (Stream Data Processing, Analysis) ○ MOA (Massive Online Analysis, esp. classification) ○ Hadoop (Distributed Processing) ○ InfoSphere Streams (Analysis)
Technology Overview ● Storage ○ MongoDB (BSON) ○ Apache CouchDB (JSON) ○ Neo4J (Graph DB) ○ Oracle NoSQL ○ IBM DB2 NoSQL ● Holistic Frameworks ○ Oracle's Big Data Suite ○ IBM's Big Data Suite ○ Karmasphere
Tool Matrix
Simple Recipe 1. Which of the 9Vs are important for me? 2. What are my sources? ○ Protocols ○ Velocity ○ Type of data (logs, XML, …) ○ ... 3. What’s my current storage architecture? ○ NoSQL? ○ Distributed?
Thank You! Questions? Axel Ngonga University of Leipzig AKSW Research Group ngonga@informatik.uni-leipzig.de http://aksw.org/AxelNgonga http://big-project.eu
Questionnaire
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.