[PPT] - Modeling Big Data Systems by Extending the Palladio Component Model PowerPoint Presentation

SLIDE 1

fortiss GmbH An-Institut Technische Universität München München, 2015-11-06

Modeling Big Data Systems by Extending the Palladio Component Model

6th Symposium on Software Performance (SSP) 2015 Johannes Kroß1, Andreas Brunnert1, Helmut Krcmar2

1 fortiss GmbH, 2 Technische Universität München

SLIDE 2

pmw.fortiss.org München, 2015-11-06 2

Motivation
Development Process and Characteristics of Big Data Systems
Palladio Component Model (PCM) Meta-model Extension
Related Work
Conclusion and Future Work

Agenda

SLIDE 3

pmw.fortiss.org München, 2015-11-06 3

Motivation
Development Process and Characteristics of Big Data Systems
Palladio Component Model (PCM) Meta-model Extension
Related Work
Conclusion and Future Work

Agenda

SLIDE 4

pmw.fortiss.org München, 2015-11-06 4

Motivation

Various big data technologies with different characteristics
Casado and Younas (2015) list two main techniques that are common for

big data systems, namely, batch and stream processing

Teradata Aster EMC Greenplum IBM Netezza HP Vertica SAP Apache Kafka Cassandra MongoDB ElephantDB Voldemort Apache Flume Hana Amazon Kinesis Cloudera Hortonworks VoltDB Autonomy splunk tableau TIBCO Pentaho S4 Apache Hadoop Apache Spark Apache Storm Apache HBase Apache Samza MapR

SLIDE 5

pmw.fortiss.org München, 2015-11-06 5

The added value of big data systems for organizations depends on the

performance of such systems (Barbierato et al. 2014)

Performance models allow for proactive evaluations of these systems
Existing performance meta-models for big data systems, however, focus on

either ... … one processing paradigm such as stream processing e.g., Ginis and Strom (2013) … or one technology such as Apache Hadoop MapReduce e.g., Ge et al. (2013)

We propose a general performance meta-model to specify shared

characteristics of big data systems

Motivation

SLIDE 6

pmw.fortiss.org München, 2015-11-06 6

Motivation
Development Process and Characteristics of Big Data Systems
Palladio Component Model (PCM) Meta-model Extension
Related Work
Conclusion and Future Work

Agenda

SLIDE 7

pmw.fortiss.org München, 2015-11-06 7

Component developers

Batch processing (e.g., using Apache MapReduce)
Stream processing (e.g., using Apache Storm)

Development Process of Big Data Systems

public void map(Object key, Text value, ..)..{ StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } public void reduce(Text key, Iterable<IntWritable> values,..)..{ int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

SLIDE 8

pmw.fortiss.org München, 2015-11-06 8

System deployers

Resource environment (e.g., Apache YARN)

Development Process of Big Data Systems

Client Node Node

Node Manager Container

Application Master

Container

Map Task

Node Manager Container

Map Task

Container

Reduce Task

Resource Manager

Node

SLIDE 9

pmw.fortiss.org München, 2015-11-06 9

We derive the following requirements of big data systems that we propose

to implement based on the finding of previous work (Kroß et al. 2015)

1. Distribution and parallelization of operations
Component developers specify reusable software components

consisting of operations using software frameworks like Apache Spark.

In doing so, they may specify, but also may not know the definite

number of simultaneous and/or total executions of an operation.

2. Clustering of resource containers
System deployers specify resource containers with resource roles

(e.g., master or worker nodes), link them to a mutual network and logically group them to a computer cluster.

Characteristics of Big Data Systems

SLIDE 10

pmw.fortiss.org München, 2015-11-06 10

Motivation
Development Process and Characteristics of Big Data Systems
Palladio Component Model (PCM) Meta-model Extension
Related Work
Conclusion and Future Work

Agenda

SLIDE 11

pmw.fortiss.org München, 2015-11-06 11

Service effect specification (SEFF) actions

PCM Meta-model Extension

AbstractAction ExternalCallAction

retryCount : Integer

DistributedCallAction

totalForkCount : Integer
simultaneousForkCount: Integer

CallReturnAction CallAction VariableUsage SetVariableAction AbstractInternal ControlFlowAction InterCallAction 0..1 0..1 0..1 * * * OperationSignature 1 OperationRequired Role 1

PCM Version 3.4.1

SLIDE 12

pmw.fortiss.org München, 2015-11-06 12

Resource environment

PCM Meta-model Extension

<<Enumeration>>

SchedulingPolicy

DELAY
FCFS
PROCESSOR_SHARING
ROUND_ROBIN

<<Enumeration>>

ResourceRole

CLUSTER
WORKER
MASTER

ClusterResourceSpecification

resourceRole : ResourceRole
actionSchedulingPolicy : SchedulingPolicy

ResourceContainer ProcessingResource Specification Resource Environment LinkingResource CommunicationLink ResourceSpecification 1 1 1 1 * 0..1 1 1 * * * *

PCM Version 3.4.1

SLIDE 13

pmw.fortiss.org München, 2015-11-06 13

Service effect specification (SEFF) diagram

PCM Meta-model Extension

SLIDE 14

pmw.fortiss.org München, 2015-11-06 14

Resource environment diagram

PCM Meta-model Extension

SLIDE 15

pmw.fortiss.org München, 2015-11-06 15

Motivation
Development Process and Characteristics of Big Data Systems
Palladio Component Model (PCM) Meta-model Extension
Related Work
Conclusion and Future Work

Agenda

SLIDE 16

pmw.fortiss.org München, 2015-11-06 16

Ginis and Strom (2013) present a method for predicting the response time
f stream processes in distributed systems
Verma et al. (2011) introduce the ARIA framework which specifies on

strategy scheduling of single Apache MapReduce jobs

Vianna et al. (2013) propose an analytical performance model which

focuses on the pipeline between map and reduce jobs

Barbierato et al. (2013) and Ge et al. (2013) present modeling techniques

for Apache MapReduce which allow to estimate response times only

Castiglione et al. (2014) use Markovian agents and mean field analysis to

model big data batch applications and to provide information about performance of cloud-based data processing architectures

Related Work

SLIDE 17

pmw.fortiss.org München, 2015-11-06 17

Motivation
Development Process and Characteristics of Big Data Systems
Palladio Component Model (PCM) Meta-model Extension
Related Work
Conclusion and Future Work

Agenda

SLIDE 18

pmw.fortiss.org München, 2015-11-06 18

We introduced a modeling approach that allows to model essential

characteristics of data processing as found in big data systems

We presented to meta-model extensions for PCM ..

…to model a computer cluster and …to apply distributed and parallel operations on this cluster

We plan to ...

… complete extending the simulation framework SimuCom … fully evaluate our extensions for up- and downscaling scenarios … automatically derive performance models based on measurement data

Conclusion and Future Work

SLIDE 19

pmw.fortiss.org München, 2015-11-06 19

References

Barbierato, E., Gribaudo, M., Iacono, M.: Performance evaluation of nosql big-data applications

using multi-formalism models. Future Generation Computer Systems 37(0), 345-353 (2014)

Casado, R., Younas, M.: Emerging trends and technologies in big data processing. Concurrency

and Computation: Practice and Experience 27(8), 2078-2091 (2015)

Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Modeling performances of concurrent big

data applications. Software: Practice and Experience (2014)

Ge, S., Zide, M., Huet, F., Magoules, F., Lei, Y., Xuelian, L.: A Hadoop MapReduce performance

prediction method. In: Proceedings of the IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 820-825 (2013)

Ginis, R., Strom, R.E.: Method for predicting performance of distributed stream processing
systems. US Patent 8,499,069, url: https://www.google.com/patents/US8499069 (2013)
Kroß, J., Brunnert, A., Prehofer C., Runkler, T., Krcmar, H.: Stream processing on demand for

lambda architectures. Computer Performance Engineering (Vol. 9272) Eds.: M. Beltrán, W. Knottenbelt, and J. Bradley, pp. 243-257. Springer International Publishing (2015)

Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for

mapreduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic

Computing. pp. 235-244. ACM, New York, NY, USA (2011)
Vianna, E., Comarela, G., Pontes, T., Almeida, J., Almeida, V., Wilkinson, K., Kuno, H., Dayal, U.:

Analytical performance models for mapreduce workloads. International Journal of Parallel Programming 41(4), 495-525 (2013)

SLIDE 20

pmw.fortiss.org München, 2015-11-06 20

Modeling Big Data Systems by Extending the Palladio Component Model

6th Symposium on Software Performance (SSP) 2015 Johannes Kroß1, Andreas Brunnert1, Helmut Krcmar2

Agenda

Agenda

Motivation

big data systems, namely, batch and stream processing

Teradata Aster EMC Greenplum IBM Netezza HP Vertica SAP Apache Kafka Cassandra MongoDB ElephantDB Voldemort Apache Flume Hana Amazon Kinesis Cloudera Hortonworks VoltDB Autonomy splunk tableau TIBCO Pentaho S4 Apache Hadoop Apache Spark Apache Storm Apache HBase Apache Samza MapR

performance of such systems (Barbierato et al. 2014)

either ... … one processing paradigm such as stream processing e.g., Ginis and Strom (2013) … or one technology such as Apache Hadoop MapReduce e.g., Ge et al. (2013)

characteristics of big data systems

Motivation

Agenda

Component developers

Development Process of Big Data Systems

System deployers

Development Process of Big Data Systems

Client Node Node

Node Manager Container

Container

Node Manager Container

Container

Resource Manager

Node

to implement based on the finding of previous work (Kroß et al. 2015)

consisting of operations using software frameworks like Apache Spark.

number of simultaneous and/or total executions of an operation.

(e.g., master or worker nodes), link them to a mutual network and logically group them to a computer cluster.

Characteristics of Big Data Systems

Agenda

Service effect specification (SEFF) actions

PCM Meta-model Extension

Resource environment

PCM Meta-model Extension

Service effect specification (SEFF) diagram

PCM Meta-model Extension

Resource environment diagram

PCM Meta-model Extension

Agenda

strategy scheduling of single Apache MapReduce jobs

focuses on the pipeline between map and reduce jobs

for Apache MapReduce which allow to estimate response times only

model big data batch applications and to provide information about performance of cloud-based data processing architectures

Related Work

Agenda

characteristics of data processing as found in big data systems

…to model a computer cluster and …to apply distributed and parallel operations on this cluster

… complete extending the simulation framework SimuCom … fully evaluate our extensions for up- and downscaling scenarios … automatically derive performance models based on measurement data

Conclusion and Future Work

References

Q&A

Johannes Kroß

kross@fortiss.org performancegroup@fortiss.org pmw.fortiss.org