Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl - - PowerPoint PPT Presentation
Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl - - PowerPoint PPT Presentation
Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl michael@perzl.org Agenda Ganglia what is it ? Ganglia components and data flow An introduction to RRDTool Ganglia metrics what can be measured ? New
2
Monitoring Systems and POWER5/6 LPARs with Ganglia
Agenda
- Ganglia – what is it ?
- Ganglia components and data flow
- An introduction to RRDTool
- Ganglia metrics – what can be measured ?
- New POWER5/6 metrics (AIX & Linux)
- Extending Ganglia with gmetric
- Add device specific information to Ganglia
- Ganglia network communication
- Installation issues
- Where to get Ganglia for AIX and Linux on POWER ?
- Best practices
- Future additions / plans
- Discussion
- Links
Ganglia – what is it ?
4
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (1/3)
- Ganglia is an Open Source cluster performance monitoring tool and has been
extended to include POWER5/6 features like shared processor LPARs, entitlement, physical CPU usage etc.
- This session covers:
– the technical details of Ganglia and the POWER5/6 extensions – how to set it up and use it to monitor all LPARs in a single machine and lots of machines
5
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (2/3)
Ganglia properties:
- scalable distributed monitoring system for high-performance computing
systems such as clusters and grids
- based on a hierarchical design targeted at federations of clusters
- relies on a multicast-based listen/announce protocol to monitor state within
clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state
- leverages widely used technologies such as
– XML for data representation – XDR (eXternal Data Representation) for compact, portable data transport – RRDtool for data storage and visualization
- uses carefully engineered data structures and algorithms to achieve very low
per-node overheads and high concurrency
- robust implementation
- Open Source, written in C
– Downloaded 110,000+ times, 145+ countries, 500+ clusters, 2000+ nodes
6
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (3/3)
Ganglia properties (cont.):
- has been ported to an extensive set of operating systems and processor
architectures:
– AIX – Darwin – FreeBSD – HP-UX – IRIX – Linux – OSF – NetBSD – Solaris – Windows (via Cygwin)
- is currently in use on over 500+ clusters around the world
- has been used to link clusters across university campuses and around the
world and can scale to handle clusters with 2000+ nodes
– check http://ganglia.info/ for more details
Ganglia components and data flow
8
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia components
The ganglia system consists of:
- two unique daemons:
– Ganglia Monitoring Daemon (gmond)
- monitoring daemon, collects the metrics
- runs on each node
– Ganglia Meta Daemon (gmetad)
- polls all gmond clients and stores the collected metrics in Round-Robin
Databases (RRDs)
- a PHP-based web frontend
- a few other small utility programs
– gmetric
- can be used to easily extend Ganglia with additional user-defined metrics
– gstat – gexec
9
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – Schematic View
From: “Ganglia: Past, Present and Future” by Matt Massie: URL: http://ganglia.info/talks/lug_lbl_talk/
10
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Architecture
11
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Monitoring Daemon (gmond)
- Ganglia Monitoring Daemon (gmond) is a multi-threaded daemon which runs
- n each cluster node you want to monitor.
- Installation is easy:
– just the daemon and a configuration file (/etc/gmond.conf)
- gmond has four main responsibilities:
- 1. monitor changes in host state
- 2. announce relevant changes
- 3. listen to the state of all other ganglia nodes via a unicast or multicast channel
- 4. answer requests for an XML description of the cluster state
- Each gmond transmits information in two different ways:
– unicasting or multicasting host state in external data representation (XDR) format using UDP messages – sending XML over a TCP connection
12
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Meta Daemon (gmetad) (1/2)
- Ganglia Meta Daemon (gmetad) is a daemon which typically only runs on one
specific cluster node – or on more when using a staged setup.
- Installation is easy:
– just the daemon and a configuration file (/etc/gmetad.conf)
- Federation in Ganglia is achieved using a tree of point-to-point connections
amongst representative cluster nodes to aggregate the state of multiple clusters.
- At each node in the tree a gmetad
– periodically polls a collection of child data sources – parses the collected XML – saves all numeric volatile metrics to round-robin databases – exports the aggregated XML over a TCP socket to clients
13
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Meta Daemon (gmetad) (2/2)
- Data sources may be either
– gmond daemons, representing specific clusters
- r
– other gmetad daemons, representing sets of clusters
- Data sources use source IP addresses for access control
– Multiple IP addresses can be specified for failover – The capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster
14
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia PHP web frontend (1/2)
Web frontend properties:
- provides a view of the gathered information via real-time dynamic web pages
- displays Ganglia data in a meaningful way for system administrators and users
– For example, one can view the CPU utilization over the past hour, day, week, month,
- r year
– The web frontend shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics
15
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia PHP web frontend (2/2)
Web frontend properties (cont.):
- depends on the existence of the gmetad which provides it with data from
several Ganglia sources
- opens the local port 8651 (by default) and expects to receive a Ganglia XML
tree
- the web pages themselves are highly dynamic; any change to the Ganglia data
appears immediately on the site
– This behavior leads to a very responsive site, but requires that the full XML tree be parsed on every page access – Therefore, the Ganglia web frontend should run on a fairly powerful, dedicated machine if it presents a large amount of data
- is written in the PHP scripting language and uses graphs generated by gmetad
to display history information
- has been tested on many flavors of Unix (primarily Linux) with the Apache web
server and the PHP 4.1 module
16
Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond
One daemon per node/LPAR
Operating System performance stats API
Ganglia - data flow (1/4)
File access Network Web
17
Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad rrdtool database
- f statistics
One daemon per node/LPAR Runs on web server
Browser
/etc/gmetad.conf Operating System performance stats API
Ganglia - data flow (2/4)
File access Network Web
18
Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad rrdtool database
- f statistics
Apache2 + PHP5
One daemon per node/LPAR Runs on web server
Browser Ganglia FE scripts
/etc/gmetad.conf Operating System performance stats API
Ganglia - data flow (3/4)
File access Network Web
19
Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad rrdtool database
- f statistics
Apache2 + PHP5 gmetric
One daemon per node/LPAR Runs on web server User command
Browser Ganglia FE scripts
/etc/gmetad.conf Operating System performance stats API
Ganglia - data flow (4/4)
File access Network Web
20
Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad rrdtool database
- f statistics
Apache2 + PHP5
One daemon per node/LPAR Only one instance with the Web Server
Browser PHP scripts
/etc/gmetad.conf
File access Network
/etc/gmond.conf
gmond
/etc/gmond.conf
gmond Web
Ganglia - data flow again
An introduction to RRDTool
22
Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool
- Homepage: http://oss.oetiker.ch/rrdtool/
- RRD is the Acronym for Round-Robin Database.
- RRD is a system to store and display time-series data (i.e., network bandwidth,
machine-room temperature, server load average).
- It stores the data in a very compact way that will not expand over time (fixed
size of DB), and it presents useful graphs by processing the data to enforce a certain data density.
- It can be used either via simple wrapper scripts (from shell or Perl) or via
frontends that poll network devices and put a friendly user interface on it. RRDTool is the industry standard tool to store and display time-series data!
23
Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool example graph
Graph taken from http://oss.oetiker.ch/rrdtool/gallery/index.en.html Graph shows inbound and outbound call traffic going in and out of the switch via the 6 trunks connected to the Diamond exchange. Inbound traffic shown as positive and uses a lowest-free fill
- method. Outbound traffic shown as negative uses a distributed fill method. Tech details on RRDtrac.
24
Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool example
# rrdtool create test.rrd \
- -start 920804400 \
- -step 300 \
DS:km:COUNTER:600:U:U \ RRA:AVERAGE:0.5:1:24 # rrdtool update test.rrd 920804700:12345 920805000:12357 920805300:12363 # rrdtool update test.rrd 920805600:12363 920805900:12363 920806200:12373 # rrdtool update test.rrd 920806500:12383 920806800:12393 920807100:12399 # rrdtool update test.rrd 920807400:12405 920807700:12411 920808000:12415 # rrdtool update test.rrd 920808300:12420 920808600:12422 920808900:12423 # rrdtool graph kilometer.png \
- -start 920804400 \
- -end 920808000 \
DEF:mykm=test.rrd:km:AVERAGE \ LINE2:mykm#FF0000
Ganglia metrics – what can be monitored ?
26
Monitoring Systems and POWER5/6 LPARs with Ganglia
Metrics
Definition of a metric:
- A metric is a certain observed property of the system.
Number of metrics:
- 34 standard metrics, i.e., available (i.e., defined) on all platforms
- Additional platform dependent metrics available
– Solaris
- 8 additional metrics available
– HP-UX
- 4 additional metrics available
– AIX
- 18 additional new metrics available for POWER5/6 !!!
- details later….
Remarks:
- One RRD database per Ganglia metric is used
- Database size is fixed (~ 12 kB per RRD database with default settings)
- Some standard metrics do not exist on all platforms, e.g., some metrics (coming from
Linux) don’t exist or don’t make sense on AIX
27
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia standard metrics (1/2)
- boottime
–system boot timestamp
- bytes_in
–number of network bytes received per second
- bytes_out
–number of network bytes sent out per second
- cpu_aidle
–percent of time since boot idle CPU –not defined on AIX, Linux yes
- cpu_idle
–percent CPU idle time
- cpu_nice
–percent CPU nice –not defined on AIX, Linux yes
- cpu_num
–number of CPUs
- cpu_intr
–number of interrupts (??) –not defined on AIX, Linux yes
- cpu_sintr
–number of system interrupts (??) –not defined on AIX, Linux yes
- cpu_speed
–speed of CPUs in MHz
- cpu_system
–percent CPU system
- cpu_user
–percent CPU user
- cpu_wio
–CPU time spent waiting for I/O
- disk_free
–total free disk space in GB
- disk_total
–total available disk space in GB
- load_one
–load average over 1 minute
- load_five
–load average over 5 minutes
- load_fifteen
–load average over 15 minutes
28
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia standard metrics (2/2)
- machine_type
–type of machine (e.g., POWER5)
- mem_total
–total available memory in kB
- mem_free
–amount of free memory in kB
- mem_shared
–amount of shared memory –not defined on AIX, Linux yes
- mem_buffers
–amount of memory used for buffers –not defined on AIX, Linux yes
- mem_cached
–amount of memory used for cache –AIX: numperm memory pages
- mtu
–MTU size reported in bytes
- s_name
–name of OS
- s_release
–OS release version (on AIX: level of fileset bos.mp)
- part_max_used
–most filled disk partition –not defined on AIX, Linux yes
- pkts_in
–number of network packets received
- pkts_out
–number of network packets sent out
- proc_run
–total number of running processes
- proc_total
–total number of processes
- swap_free
–free swap space in kB –AIX: paging space free
- swap_total
–total available swap space in kB –AIX: paging space
New POWER5/6 metrics (AIX & Linux)
30
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia and POWER5/6
Current deficiences of Ganglia on POWER5/6:
- Ganglia does not understand Shared Processor LPAR statistics
– things like capped, weight, CPU entitlement etc. – these metrics can be added (see below)
How to fix these deficiences, i.e., add these new metrics ?
- Easy solution:
– Extend Ganglia with the utility program gmetric – Details in section “Extending Ganglia with gmetric” (later)
- Preferred solution:
– Add these new metrics to the gmond implementation on AIX and Linux on POWER – Requires significant patching of Ganglia source code – This has been completed and tested, i.e., ready to go ! – Where to get it ?
- My personal web site: http://www.perzl.org/ganglia/
31
Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5/6 metrics (1/5)
Question:
- How are those additional metrics programmed and where do they get the
information from? Answer:
- AIX:
– Only the APIs provided by libperfstat are used – As a consequence the fileset bos.perf.libperfstat must be installed
- Linux:
– Only entries in the /proc file system are used, e.g., /proc/cpuinfo, /proc/meminfo, /proc/ppc64/lparcfg etc. – No additional fileset must be installed
32
Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5/6 metrics (2/5)
1) capped 2) cpu_entitlement 3) cpu_in_lpar 4) cpu_in_machine 5) cpu_in_pool 6) cpu_pool_idle 7) cpu_used 8) disk_read 9) disk_write 1) kernel64bit 2) lpar 3) lpar_name 4) lpar_num 5) oslevel 6) serial_num 7) smt 8) splpar 9) weight
List of 18 additional new metrics for POWER5 (AIX & Linux):
33
Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5 metrics (3/5)
1) capped
– Type: String value – returns "yes" if the system is a POWER5 Shared Processor LPAR which is running in capped mode or "no"
- therwise
2) cpu_entitlement
– Type: Float value – returns the Capacity Entitlement of the system in units of physical CPUs
3) cpu_in_lpar
– Type: Integer value – returns the number of CPUs the OS sees in the system. In a POWER5 Shared Processor LPAR this returns the number of virtual CPUs. When SMT is enabled this number is doubled.
4) cpu_in_machine
– Type: Integer value – returns the number of physical CPUs in the whole system
5) cpu_in_pool
– Type: Integer value – returns the number of physical CPUs in the Shared Processor Pool
6) cpu_pool_idle
– Type: Float value – returns in fractional numbers of physical CPUs how much the Shared Processor Pool is idle
34
Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5 metrics (4/5)
1) cpu_used
– Type: Float value – returns in fractional numbers of physical CPUs how much compute resources this shared processor has used since the last time this metric was measured
2) disk_read
– Type: Float value – returns in units of kB the total read I/O of the system
3) disk_write
– Type: Float value – returns in units of kB the total write I/O of the system
4) kernel64bit
– Type: String value – returns "yes" if the running kernel is a 64-bit kernel or "no" otherwise
5) lpar
– Type: String value – returns "yes" if the system is a LPAR or "no" otherwise
6) lpar_name
– Type: String value – returns the name of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable message otherwise
35
Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5 metrics (5/5)
1) lpar_num
– Type: Integer value – returns the partition ID of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable message otherwise
2) oslevel
– Type: String value – returns the version string as provided by the AIX command 'oslevel‘
3) serial_num
– Type: String value – returns the serial number of the system as provided by the AIX command 'uname‘
4) smt
– Type: String value – returns "yes" if SMT is enabled or "no" otherwise
5) splpar
– Type: String value – returns "yes" if the system is running in a shared processor LPAR or "no" otherwise
6) weight
– Type: Integer value – returns the weight of the LPAR running in uncapped mode
Extending Ganglia with gmetric
37
Monitoring Systems and POWER5/6 LPARs with Ganglia
Extending Ganglia
How can I easily add metrics to Ganglia ?
- Ganglia has a simple way to add metrics that should be monitored
- The utility program gmetric is used for that purpose
- These new metrics are then automatically added to the database and web
server data and graphs
New metric
(periodically call gmetric to provide new data)
38
Monitoring Systems and POWER5/6 LPARs with Ganglia
gmetric example – Machine firmware level
Example of a static metric: Machine firmware level
- gmetric --name firmware \
- -value `lsattr -El sys0 -a modelname -F value` \
- -type "string"
Remarks:
- The above will only save the statistics once.
- The firmware level is unlikely to change without reboot, therefore it is sufficient
to run this command once.
39
Monitoring Systems and POWER5/6 LPARs with Ganglia
gmetric example – database transactions
Example of a variable metric: Transaction rate of your database
- To add the number of transaction and assuming you have a script that will work
this out called "transactions" that returns a number with a decimal point – you will have to write this script yourself !
- gmetric --name tpm \
- -value `/usr/local/bin/transactions` \
- -type double
Remarks:
- This command will only save the statistics once.
- As the number of transactions per minute will definitely change, to get these
always up to date, it is recommended to run the command regularly, e.g., run
- nce every 60 seconds via cron.
Add device specific information to Ganglia
41
Monitoring Systems and POWER5/6 LPARs with Ganglia
General Remarks
Add device specific information to Ganglia via gmetric for
- Network adapters
- Disks
- Disk adapters (SCSI + Fibre Channel)
Available in two variants:
- as a daemon (implemented in C)
- as a shell script
Network information:
- Daemons:
–g_aix_netif for AIX –g_linux_netif for Linux on POWER
- Shell script:
–ent_adapter.sh for AIX
Disk information:
- Daemons:
–g_aix_disk for AIX –g_linux_disk for Linux on POWER
Disk adapter information (AIX only)
- Daemons
–g_aix_adapter for AIX
- Shell script:
–fcs_adapter.sh for AIX
42
Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (1/4)
Network adapter information: g_aix_netif or g_linux_netif
- Utility daemon program periodically calls gmetric (interval configurable)
- Monitored parameters per network interface:
– Bytes received / second – Bytes transmitted / second – Packets received / second – Packets transmitted / second – MTU size (AIX only)
Example:
- g_aix_netif -s5 -b3 -p3 -m en1
- Every 5 seconds get the number of bytes/sec and packets/sec transferred in
and out as well as the current MTU size for network interface en1.
43
Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (2/4)
Disk information: g_aix_disk or g_linux_disk
- Utility daemon program periodically calls gmetric (interval configurable)
- Monitored parameters per disk:
– Bytes read / second – Bytes written / second
Example:
- g_aix_disk -s10 -i -o hdisk3
- Every 10 seconds get the number of bytes/sec transferred in and out for disk
hdisk3.
44
Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (3/4)
Disk adapter information: g_aix_adapter (AIX only)
- Utility daemon program periodically calls gmetric (interval configurable)
- Monitored parameters per disk adapter:
– Bytes read / second – Bytes written / second
Example:
- g_aix_adapter –s5 –i –o scsi0
- Every 5 seconds get the number of bytes/sec transferred in and out for SCSI
adapter scsi0.
45
Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (4/4)
./g_aix_netif: Version 1.0 g_aix_netif [OPTIONS] <network-interface> [-?] or [-h] This help information [-s seconds] The time between output (default is 60 seconds), seconds must be in the range [1..3600]. [-c loop_count] The number of loops (default = 20 million), loop_count must be in the range [10..20000000]. [-b 0|1|2|3] Show bytes received/sent (default = off) 0 = don't show, 1 = show incoming 2 = show outgoing, 3 = show both [-p 0|1|2|3] Show packets received/sent (default = off) 0 = don't show, 1 = show incoming 2 = show outgoing, 3 = show both [-m] Show MTU size for <network-interface> (default = off) [-d] Debug mode, remain in foreground, output only to screen Please note: If not in debug mode (-d) g_aix_netif runs the same command via a shell and assumes gmetric will be found in the PATH and gmond is already running. g_aix_netif will disconnect from your terminal to become a daemon. Use "ps -ef | grep g_aix_netif" to confirm it is running.
Ganglia network communication
47
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia network communication
Ganglia by default uses Multicast
- Some network administrators might not like Multicast
- This can be changed to Unicast - requires changes to default config files
– Very simple changes to the gmond.conf files
Ganglia gmond network “chatter”
- The processes talk to each other quite a lot
- Not large in 100Mb or 1Gb or virtual networks terms
- Recommendation:
– Ganglia over admin network rather than user network if possible
48
Monitoring Systems and POWER5/6 LPARs with Ganglia
gmond: Multicast configuration example
gmond.conf: /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes on some Linux distros: no user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 3600 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside
- f a <CLUSTER> tag. If you do not specify a cluster tag, then all
<HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */ cluster { name = "System p5 Model 550"
- wner = "unspecified"
latlong = "unspecified" url = "unspecified" } gmond.conf continued: ... /* The host section describes attributes of the host, like the location */ host { location = "System p5 Model 550" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { mcast_join = 239.2.11.71 port = 8649 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 }
49
Monitoring Systems and POWER5/6 LPARs with Ganglia
gmond: Unicast configuration example
gmond.conf: /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes on some Linux distros: no user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 3600 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside
- f a <CLUSTER> tag. If you do not specify a cluster tag, then all
<HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */ cluster { name = "System p5 Model 550"
- wner = "unspecified"
latlong = "unspecified" url = "unspecified" } ... gmond.conf continued: ... /* The host section describes attributes of the host, like the location */ host { location = "System p5 Model 550" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { host = p550-aix port = 8649 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { port = 8649 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 }
50
Monitoring Systems and POWER5/6 LPARs with Ganglia
gmetad: configuration example
# Format: # data_source "my cluster" [polling interval] address1:port addreses2:port ... # # The keyword 'data_source' must immediately be followed by a unique string which identifies the source, # then an optional polling interval in seconds. The source will be polled at this interval on average. # If the polling interval is omitted, 15sec is asssumed. # # A list of machines which service the data source follows, in the format ip:port, or name:port. # If a port is not specified then 8649 (the default gmond port) is assumed. # default: There is no default value # # data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655 # data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651 # data_source "another source" 1.3.4.7:8655 1.3.4.8 data_source "Systen p5 Model 550" 15 p550-aix:8649 p550-nim:8649 # Round-Robin Archives # You can specify custom Round-Robin archives here (defaults are listed below) # # RRAs "RRA:AVERAGE:0.5:1:240" "RRA:AVERAGE:0.5:24:240" "RRA:AVERAGE:0.5:168:240" \ # "RRA:AVERAGE:0.5:672:240" "RRA:AVERAGE:0.5:5760:370" # The name of this Grid. All the data sources above will be wrapped in a GRID tag with this name. # default: Unspecified gridname "Mycluster"
51
Monitoring Systems and POWER5/6 LPARs with Ganglia
Security issues
- Ganglia setup can be tunneled through SSH if
– certain security guidelines must be adhered to – security guidelines allow only SSH connections, nothing else
- Detailed description of such a setup can be found at:
– http://www.ibm.com/collaboration/wiki/display/LinuxP/Ganglia/
Installation issues
53
Monitoring Systems and POWER5/6 LPARs with Ganglia
Build requirements for Ganglia from scratch
- RRDTool package dependencies
–freetype2 –libart_lpgl –libpng –Perl –zlib
- Apache 2 package dependencies
–httpd
- expat
- Perl
- zlib
–mod_ssl
- httpd
- openssl
- PHP package dependencies
–none
- gmond package dependencies
–none
- gmetad packages dependencies
–Apache 2 –PHP –libxml2 –rrdtool
54
Monitoring Systems and POWER5/6 LPARs with Ganglia
Installation issues on POWER5/6
Linux:
- Installation on any recent Linux distribution is very easy!
– All Linux distributions contain the necessary RPM packages.
AIX:
- For AIX some (though not all) of the prerequisites can be fulfilled with RPM
packages from the AIX Toolbox for Linux Applications:
– http://www.ibm.com/servers/aix/products/aixos/linux/
- There was also a problem for people getting hold of Apache 2 on AIX with the
latest PHP version, so Nigel Griffiths wrote a How-To to build these popular Open Source tools for AIX with GCC:
– http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
- Compilation of Ganglia with the IBM C/C++ compilers is also easy and is done
for the Ganglia binaries provided on my personal website:
– http://www.perzl.org/ganglia/
Where to get Ganglia for AIX and Linux on POWER ?
56
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia binaries and source code for POWER5/6 (1/2)
Where to get it ?
- My personal web site: http://www.perzl.org/ganglia/
What is available ?
- Binary and source RPMs available for:
– AIX v4.3.3 (gmond only) – AIX 5L v5.1 and v5.2 – AIX 5L v5.3 – Red Hat Enterprise Linux 4 and 5 – SLES 9 and SLES 10 – Fedora Core 4, 5, 6 and 7 – openSUSE 10.0, 10.1, 10.2 and 10.3
- Precompiled Apache 2 + PHP for Ganglia gmetad on AIX5L v5.1 and higher
- My enhanced Ganglia web interface
- Device specific information (network, disk, adapter) added to Ganglia via
gmetric
57
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia binaries and source code for POWER5/6 (2/2)
- Required source code changes against version 3.0.5 of Ganglia to incorporate
the new POWER5/6 metrics:
– configure – configure.in – gmetad/Makefile.in – gmond/gmond.c – lib/apr_net.c – lib/libgmond.c – lib/protocol.x – libmetrics/libmetrics.h – libmetrics/aix/metrics.c – libmetrics/linux/metrics.c – libmetrics/tests/test-metrics.c
58
Monitoring Systems and POWER5/6 LPARs with Ganglia
Download statistics of http://perzl.org/ganglia/ (1/2)
Data as of November 1st, 2007
Monthly access of http://www.perzl.org/ganglia/
5 5 25 92 130 250 246 343 273 326 282 266 280 256 457 50 100 150 200 250 300 350 400 450 500 8 / 2 6 9 / 2 6 1 / 2 6 1 1 / 2 6 1 2 / 2 6 1 / 2 7 2 / 2 7 3 / 2 7 4 / 2 7 5 / 2 7 6 / 2 7 7 / 2 7 8 / 2 7 9 / 2 7 1 / 2 7 Number of accesses per month
59
Monitoring Systems and POWER5/6 LPARs with Ganglia
Download statistics of http://perzl.org/ganglia/ (2/2)
Data as of November 1st, 2007 Monthly download numbers of binary packages
20 40 60 80 100 120 140 160 11/2006 12/2006 01/2007 02/2007 03/2007 04/2007 05/2007 06/2007 07/2007 08/2007 09/2007 10/2007 gmond RPM (all versions) gmetad RPM (all versions) Apache .tar.bz2 (AIX
- nly)
Best Practices
61
Monitoring Systems and POWER5/6 LPARs with Ganglia
Some things to consider before you start…
- A. Hostnames
– To Ganglia a new hostname is a new machine – Has to resolve IP address so use DNS
- B. IP addresses stable
– Make sure you are not going to change IP addresses
C.Time and date
– Make sure the timezone, time and date is consistent on all machines in a cluster – Use of NTP is recommended
So
– These are normal on production machines – For prototype and test systems – get this right before starting Ganglia
- Simple Ganglia How-To available for people setting up their first Ganglia
system available at:
– http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia On all nodes
62
Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad rrdtool
- database
- f stats
Apache2 + PHP5
Daemon one per node/LPAR Only one copy with the Web Server
Browser PHP scripts
/etc/gmetad.conf
File access Network
/etc/gmond.conf
gmond
/etc/gmond.conf
gmond Web
1 2 3 1 1
Ganglia data flow and what goes where…
63
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices
- Preferred setup
- Ganglia sampling intervals
- Ganglia default ports
- Shared Ethernet statistics
- Fibre Channel statistics
- Enhanced web interface
64
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Preferred Setup
- Define each System p machine with all its LPARs as a separate cluster
- Use Unicast for network communication
- Define at least two LPARs per System p machine as gmond hosts for gmetad
– One would be sufficient, however, two is better for high availibility reasons
- Define those two LPARs in /etc/gmetad.conf as the information brokers for that
machine
- From gmetad: Don’t poll the gmond hosts more frequently than every 15 secs
- Know upfront what time intervals to use for sampling (RRAs stanza
in /etc/gmetad.conf)
– See next slides
- Use my extensions for
– Ethernet adapters – Fibre Channel adapters – Web interface
65
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (1/6)
Important to know:
- The sampling interval is defined in /etc/gmetad.conf.
- The "RRAs" stanza is used to defined individual settings.
- The sampling settings are global.
- If no "RRAs" stanza is defined a default configuration is used.
- For historic reasons all values are specified in intervals of 15 seconds.
66
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (2/6)
Example: Default settings in Ganglia
- RRAs "RRA:AVERAGE:0.5:1:240" \
"RRA:AVERAGE:0.5:24:240" \ "RRA:AVERAGE:0.5:168:240" \ "RRA:AVERAGE:0.5:672:240" \ "RRA:AVERAGE:0.5:5760:370"
used for Translation: display of
- Take 240 samples at 1 × 15 seconds intervals
hour
- Take 240 samples at 24 × 15 seconds (= 6 minutes) intervals
day
- Take 240 samples at 168 × 15 seconds (= 42 minutes) intervals
week
- Take 240 samples at 672 × 15 seconds (= 168 minutes) intervals
month
- Take 370 samples at 5760 × 15 seconds (= 24 hours) intervals
year
67
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (3/6)
Example: 1-minute sampling for one year
- RRAs "RRA:AVERAGE:0.5:4:525600"
Translation:
- Take 525600 samples at 4 × 15 seconds (= 1 minute) intervals
- 525600 = 60 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
68
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (4/6)
Example: 1-minute sampling for 6 months, 5-minute sampling for 2 years
- RRAs "RRA:AVERAGE:0.5:4:259200" \
"RRA:AVERAGE:0.5:20:210240"
Translation:
- Take 259200 samples at every 4 × 15 seconds (= 1 minute) intervals
– 259200 = 60 (samples/hour) × 24 (hours) × 30 (days) × 6 (months)
- Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals
– 210240 = 12 (samples/hour) × 24 (hours) × 365 (days) × 2 (years)
69
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (5/6)
Example: 15-second sampling for 1 day, 1-minute sampling for 2 months, 10-minute sampling for 1 year
- RRAs "RRA:AVERAGE:0.5:1:5760" \
"RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:40:52560"
Translation:
- Take 5760 samples at every 1 × 15 seconds intervals
– 5760 = 4 (samples/minute) 60 (samples/hour) × 24 (hours)
- Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
– 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
- Take 52560 samples at every 40 × 15 seconds (= 10 minutes) intervals
– 52560 = 6 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
70
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (6/6)
Example: 1-minute sampling for 2 months, 5-minute sampling for 6 months, 15-minute sampling for 3 years
- RRAs "RRA:AVERAGE:0.5:4:86400" \
"RRA:AVERAGE:0.5:20:51840" \ "RRA:AVERAGE:0.5:60:105120"
Translation:
- Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
– 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
- Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals
– 51840 = 12 (samples/hour) × 24 (hours) × 30 (days) × 6 (month)
- Take 105120 samples at every 60 × 15 seconds (= 15 minutes) intervals
– 105120 = 4 (samples/hour) × 24 (hours) × 365 (days) × 3 (years)
71
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Default Ports
Ganglia by default uses the following ports:
- 8649
The port gmond uses for
– Sending to other gmonds via UDP (udp_send_channel in /etc/gmond.conf) – Receiving from other gmonds via UDP (udp_receive_channel in /etc/gmond.conf) – Sending an XML description of the state of the cluster (tcp_accept_channel in /etc/gmond.conf)
- 8651
The port gmetad will answer requests for XML.
- 8652
The port gmetad will answer queries for XML. This facility allows simple subtree and summation views of the XML tree.
72
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Shared Ethernet Statistics
Question: How to monitor SEA statistics on the VIO server ?
- The AIX libperfstat library seems not to report any statistics about Ethernet
adapters if there are no interfaces defined on that adapter.
- Only seldom interfaces are defined on SEAs.
- The AIX command ‘entstat’ however provides these statistics.
Solution: Extension through gmetric via a shell script
- Korn shell script ‘ent_adapter.sh’
- Get it from http://www.perzl.org/ganglia/devicespecific.html
- Graphs will appear immediately for that specific host
73
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Fibre Channel Statistics
Question: How to monitor Fibre Channel statistics on the VIO server ?
- The AIX libperfstat library seems not to report any statistics about Fibre
Channel adapters if there are no disks attached to the adapter.
- Tapes, for instance, would be left out.
- The AIX command ‘fcstat’ however provides these statistics.
Solution: Extension through gmetric via a shell script
- Korn shell script ‘fcs_adapter.sh’
- Get it from http://www.perzl.org/ganglia/devicespecific.html
- Graphs will appear immediately for that specific host
74
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Enhanced Web Interface (1/2)
- Get it from:
http://www.perzl.org/ganglia/webinterface.html
Includes the following extensions:
- Jscalendar patch of Timothy D. Witham
for selecting custom intervals for graph display.
- Custom Graph Patch of Alex Balk.
- Zoomable graphs
–Adapted from the UC Berkeley Grid live Demo website. –They implemented some nice graph zooming, i.e., when you click onto a single node metric graph or on one of the Grid
- verview graphs you get a nicely scaled
bigger version of that graph.
75
Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Enhanced Web Interface (2/2)
Includes the following extensions (cont.)
- cpu_used statistics for IBM POWER5/6 systems
–A cpu_used overview graph for the cluster_view template which - when clicked onto - produces a detailed cpu_used statistics graph for the whole cluster
Future additions / plans
77
Monitoring Systems and POWER5/6 LPARs with Ganglia
Future additions / plans
- Try to incorporate the POWER5/6 additions into the Ganglia mainstream
source code
- Adapt the AIX and Linux on POWER metrics to POWER6
- Provide a custom web interface tailored specifically to POWER5/6
- Future updates on my personal web site:
– Update RPMs to newest versions – Provide RPMs for Apache + PHP on my AIX Open Source web site (very soon)
Discussion
79
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia advantages
- Overview and major statistics available at one look with nice graphs
- Available on many platforms
- Widely used, many users and many different setups
- Global view (cluster/grid) and local view (single node) available
- Easily remote accessible through web interface
- Highly customizable through config files
- Shared Processor LPAR statistics available
- Monitored data is stored in Round-Robin Databases (RRDs), i.e., this information could
be easily passed on accounting too
- Fine granular statistics possible
- Different time interval views: hour, day, week, month, year or customizable
- Open Source (source code and binary RPMs for AIX and Linux on POWER) available
- Low risk and low cost
- Easily extendible
– through utility program gmetric – through adapting the Ganglia source code (as shown in this presentation)
80
Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia disadvantages
- Not an official IBM tool
- No official support available
- Primarily a monitoring tool, not an accounting tool
- ‘‘Only‘‘ a monitoring (visualization) tool, no actions can be triggered
- AIX setup from scratch requires some work (building all prerequisite software),
although it is well documented
– Normally not necessary by using my binary RPMs
Links
82
Monitoring Systems and POWER5/6 LPARs with Ganglia
Links (1/2)
- Main Ganglia website
– http://ganglia.info/
- Ganglia Documentation
– http://ganglia.info/docs/
- Ganglia Source Code Download
– http://ganglia.sourceforge.net/downloads.php
- Ganglia POWER5/6 extensions and ready-to-run binaries (RPM files) as well
as source code
– http://www.perzl.org/ganglia/
- My personal AIX Open Source repository
– http://www.perzl.org/aix/
83
Monitoring Systems and POWER5/6 LPARs with Ganglia
Links (2/2)
- Ganglia Usage at Wikipedia
– http://ganglia.wikimedia.org/
- RRDTool homepage
– http://oss.oetiker.ch/rrdtool/
- Ganglia How-To on IBM AIX wiki site (written by Nigel Griffiths)
– http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia
- Open Source with AIX on IBM AIX wiki site (written by Nigel Griffiths)
– http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
- IBM AIX wiki site:
– http://www.ibm.com/collaboration/wiki/display/WikiPtype/Home
- IBM Linux on POWER wiki site:
– http://www.ibm.com/collaboration/wiki/display/LinuxP/Home
84
Monitoring Systems and POWER5/6 LPARs with Ganglia
Thank you for your attention ! Thank you for your attention !
Questions ?
Backup Slides
Simple setup example
87
Monitoring Systems and POWER5/6 LPARs with Ganglia
1 Simple gmond install (1/2)
On all data nodes:
- Install the gmond RPM file on each data node
– rpm –Uvh ganglia-gmond-VVV.PPP.rpm – VVV.PPP is the version number and platform like:
- 3.0.5
- aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64
- Edit the configuration file
– /etc/gmond.conf
- On Linux on POWER
– need access to /proc/ppc64/lparcfg = root user only – So also set “setuid = no” depending on Linux distribution On all data nodes
cluster { name = "unspecified"
- wner = "unspecified"
latlong = "unspecified" url = "unspecified" } cluster { name = "mycluster"
- wner = "unspecified"
latlong = "unspecified" url = "unspecified" }
Cluster name
88
Monitoring Systems and POWER5/6 LPARs with Ganglia
1 Simple gmond install (2/2)
- Use the gmond control script located in /etc to start gmond:
– /etc/init.d/gmond start
- Linux (SUSE and Red Hat)
– /etc/rc.d/init.d/gmond start
- AIX
- These scripts also automatically start gmond when booting the system
- All options are:
– start – stop – restart – status Easy to automate the install: Just a couple of files + the gmond.conf is the same
- n all nodes/LPARs
On all data nodes
89
Monitoring Systems and POWER5/6 LPARs with Ganglia
2 Simple gmetad - prerequisites (1/3)
- Gmetad needs rrdtool to store the data and Apache2 + PHP to serve the web
pages
- For Linux rrdtool, Apache2 and PHP5 is part of every Linux distribution
- For AIX you can resolve the prerequisites as follows:
– rrdtool provided at http://www.perzl.org/ganglia/ – libart_lgpl AIX Toolbox for Linux Applications – libpng AIX Toolbox for Linux Applications – freetype2 AIX Toolbox for Linux Applications – zlib AIX Toolbox for Linux Applications – Perl AIX Toolbox for Linux Applications
- AIX Toolbox for Linux Applications
– http://www.ibm.com/servers/aix/products/aixos/linux
90
Monitoring Systems and POWER5/6 LPARs with Ganglia
2 Simple gmetad install (2/3)
- Install the gmetad RPM file on each node
– rpm –Uvh ganglia-gmetad-VVV.PPP.rpm – VVV.PPP is the version number and platform like:
- 3.0.5 and
- aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64
- Edit the configuration file /etc/gmetad.conf
– data_source “mycluster" localhost Cluster name Local gmond supplies Ganglia data On the central node
91
Monitoring Systems and POWER5/6 LPARs with Ganglia
2 Start gmetad (3 of 3)
- Use the gmetad control script located in /etc to start gmetad:
– /etc/init.d/gmetad start
- Linux (SUSE and Red Hat)
– /etc/rc.d/init.d/gmetad start
- AIX
- These scripts also automatically start gmetad when booting the system
- All options are:
– start – stop – restart – status On the central node
92
Monitoring Systems and POWER5/6 LPARs with Ganglia
3 Ganglia Web Server front end setup (1/2)
- Could use existing web server but we need PHP support
– On AIX – we have only used Apache2 and PHP5 (see next foil) – On Linux use version included with the distribution (Red Hat EL 4 & 5, SUSE SLES 9 & SLES 10)
- Simple test of PHP if it works:
– Create <web-server-directory>/phptest.php – Use browser to access this file – Should print out lots of interesting data
- rpm –Uvh ganglia-web-3.0.5-1.noarch.rpm
– “noarch” as this is PHP scripts only – may have to use the “--ignoreos” flag for installation – may have to move these files to your web server directory tree
- depends on if you are running AIX or Linux
<h1>PHP Test</h1> <?PHP phpinfo() ?> On the central node
93
Monitoring Systems and POWER5/6 LPARs with Ganglia
- For AIX you will not be able to find Apache2 and PHP5 with the required
features:
– AIX CD-ROM or update download site – nope – AIX repositories (Bull or UCLA) – nope – AIX Toolbox for Linux Applications – nope – WAS – nope – IBM HTTP Server (Apache with add-ons) – nope
- Get Apache + PHP from my personal website instead!
– http://www.perzl.org/ganglia/ – http://www.perzl.org/aix/
- If you have to recompile your own Apache + PHP
– Using latest GNU GCC compiler and latest libraries - it is actually easy – Nigel Griffiths wrote the details of how to do this on the AIX 5L Wiki at
- http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
On the central node