[PDF] - Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. PDF Document

SLIDE 1

.

Isolated MPI-I/O Solution on top of MPI-1

Emin Gabrielyan, Roger D. Hersch École Polytechnique Fédérale de Lausanne, Switzerland {Emin.Gabrielyan,RD.Hersch}@epfl.ch SFIO

5th Workshop on Distributed Supercomputing:

Scalable Cluster Software

May 23-24, 2001, Sheraton Hyannis, Cape Cod, Hyannis MA

SLIDE 2

.

n

n
b

l

c

k i n g

IREAD AT IWRITE

b l

c

k i n g S y n c h r

n

i s m Positioning READ AT WRITE READ WRITE READ SHARED WRITE

IREAD IWRITE IREAD SHARED IWRITE READ AT_ALL BEG/END WRITE

Coordination explicit

ffsets

individual file pointers shared file pointers non-collective collective READ AT_ALL WRITE READ ALL WRITE READ ORDER. WRITE

READ ALL BEG/END WRITE READ ORDERED BEG/END WRITE

MPI-I/O Access Operations

The basic set of MPI-I/O interface functions consists of File Manipulation Op- erations, File View Operations and Data Access Operations. There are three orthogonal aspects to data access: positioning, synchronism, and coordination, and there are 12 respective types of read and of write operations.

SLIDE 3

. c

n

t i g u

u

s i n m e m

r

y a s w e l l a s i n f i l e n

n

c

n

t i g u

u

s i n m e m

r

y , c

n

t i g u

u

s i n f i l e c

n

t i g u

u

s i n m e m

r

y , n

n

c

n

t i g u

u

s i n f i l e n

n

c

n

t i g u

u

s i n m e m

r

y a s w e l l a s i n f i l e

The file view is a global concept, which interferes with all data access opera-

tions. For each process it specifies its own view of the shared data file: a se-

quence of pieces in the common data file that are visible for the particular

process. In order to specify the file view the user creates a derived datatype,

which defines the fragmented structure of the visible part of the file. Since each access operation can use another derived datatype that specifies the fragmentation in memory, there are two additional orthogonal aspects to data access: the fragmentation in the memory and the fragmentation of the file view. fragmentation in the memory

fragmentation of the view of file

contiguous in the file non-contiguous in the file contiguous in the memory non-contiguous in the memory Memory View File Memory View File Memory View File Memory View File

File View

SLIDE 4

.

MPI_Type_vector (3,1,2,MPI_BYTE,&T1) MPI_Type_vector (3,1,2,MPI_BYTE,&T1) MPI_Type_struct(2,...,&T3) MPI_Type_vector (3,1,2,MPI_BYTE,&T1) MPI_Type_vector (3,1,2,MPI_BYTE,&T1) MPI_Type_contiguous(2,T1,&T2) MPI_Type_contiguous(2,T1,&T2) MPI_Type_struct(2,...,&T3) MPI_Type_contiguous(2,T3,&T4) Derived Datatype T4 MPI-1 provides techniques for creating datatype objects of arbitrary data lay-

ut in memory. The opaque datatype object can be used in various MPI opera-

tions, but the layout information, once put in a derived datatype, can not be decoded from the datatype. T4

Derived Datatypes

SLIDE 5

.

MPI-I/O Implementation

MPI-I/O Interface MPI-I/O Implementation MPI-1 Interface MPI-1 Implementation Access to the internal

perations and data

structures of the MPI- 1 implementation, in

rder to decode the

layout information of the file view’s derived datatype. MPI-2 operations and the MPI-I/O subset in particular form an exten- sion to MPI-1. However a developer of MPI-I/O needs access to the source code of the MPI-1 implementation, on top of which he intends to implement MPI-I/O. For each MPI-1 implementation a specific development of MPI-I/O will be required.

SLIDE 6

.

Reverse Engineering or Memory Painting

The layout information can not be decoded from the datatype, but the behaviour of the datatype depends on the layout. We try to define a special test for a derived datatype, analyse the behaviour of the datatype and based on it, decode the layout information of the da-

tatype. For example, MPI_Recv operation receives a contiguous network stream and dis-

tributes it in memory according to the data layout of the datatype. If the memory is previously initialised with a “green colour”, and the network stream has a “red colour”, then analysis of the memory after data reception will give us the necessary information on the data layout hidden in the opaque datatype. In our solution we do not use MPI_Send and MPI_Recv operations, instead we use the MPI_Unpack standard MPI-1 operation to avoid network transfers and multiple processes usage.

Contiguous datatype Buffer of the size of the datatype T4 Derived datatype T4 Buffer of the size of T4’s extent MPI_Send(source,size,MPI_BYTE,...) MPI_Recv(destination-LB,1,T4,...)

SLIDE 7

.

MPI-I/O Interface MPI-I/O Implementation

Portable MPI-I/O Solution

Memory Painting MPI-1 Interface MPI-1 Implementation Once we have a tool for derived datatype decoding, it becomes possible to create an isolated MPI-I/O solution on top of any standard MPI-1. The Argonne National Laboratory’s MPICH implementation of MPI-I/O is intensively used with our datatype decoding technique and an isolated solution of a limited subset of MPI-I/O operations has been implemented.

SLIDE 8

.

n

n
b

l

c

k i n g

IREAD AT IWRITE

b l

c

k i n g S y n c h r

n

i s m Positioning READ AT WRITE READ WRITE READ SHARED WRITE

IREAD IWRITE IREAD SHARED IWRITE READ AT_ALL BEG/END WRITE

Coordination explicit

ffsets

individual file pointers shared file pointers non-collective collective READ AT_ALL WRITE READ ALL WRITE READ ORDER. WRITE

READ ALL BEG/END WRITE READ ORDERED BEG/END WRITE

MPI-I/O Isolation

The basic File Manipulation operations MPI_File_open and MPI_File_close; File View operation MPI_File_set_view and blocking non-collective Data Access Operations MPI_File_write, MPI_File_write_at, MPI_File_read, MPI_File_read_at are already successfully implemented in the form of an isolated independent library. Currently we are work- ing on the collective counterparts of blocking operations and trying to make use of the extended two-phase method for accessing sections of out-of-core arrays, on which the ANL implementation is based.

SLIDE 9

.

MPI-I/O Interface MPI-I/O Implementation Memory Painting MPI-FCI on Swiss-Tx

Testing Isolated MPI-I/O

Contiguous memory and file

MPI_File_write: MPI-FCI Ok
MPI_File_read: MPI-FCI Ok
MPI_File_write_at: MPI-FCI Ok
MPI_File_read_at: MPI-FCI Ok

Fragmented memory, contiguous file

MPI_File_write: MPI-FCI Ok
MPI_File_read: MPI-FCI Ok
MPI_File_write_at: MPI-FCI Ok
MPI_File_read_at: MPI-FCI Ok

Contiguous memory, fragmented file

MPI_File_write: MPI-FCI Ok
MPI_File_read: MPI-FCI Ok
MPI_File_write_at: MPI-FCI Ok
MPI_File_read_at: MPI-FCI Ok

Fragmented memory and file

MPI_File_write: MPI-FCI Ok
MPI_File_read: MPI-FCI Ok
MPI_File_write_at: MPI-FCI Ok
MPI_File_read_at: MPI-FCI Ok

The implemented operations of the isolated solution of MPI-I/O are successfully test- ed with the MPI-FCI implementation of MPI-1 on the Swiss-Tx supercomputer.

SLIDE 10

.

2 4 5 6

TNET connection ~86MB/s Routing

3 7 1

PR01 PR00

P R 2 PR04 PR06 P R 8 PR10 PR12 PR14 PR16 PR18 PR20 PR22 P R 2 4 PR26 PR28 P R 3 PR32 P R 3 4 PR36 P R 3 8 P R 4 PR42 PR44 P R 4 6 PR48 P R 5 PR52 P R 5 4 P R 5 6 PR58 PR60 P R 6 2 PR15 PR13 PR11 PR09 PR07 PR05 PR03 P R 1

IO Processor Compute Processor Switch

P R 2 9 PR27 PR25 P R 2 3 PR21 PR19 PR17 P R 4 5 PR43 P R 4 1 P R 3 9 PR37 PR35 P R 3 3 PR31 PR63 PR61 PR59 PR57 P R 5 5 PR53 PR51 P R 4 9 PR47 PR00

Gateway to the Parallel I/O of the Swiss-T1

At the bottom of the isolated MPI-I/O, we intended to provide as a high performance I/O solution a switching to the Striped File I/O system (SFIO). SFIO communication layer is implemented on top of MPI-1 and therefore SFIO is also portable. We measured a scalable performance of the SFIO on the architecture of the Swiss-Tx supercomputer.

SLIDE 11

.

50 100 150 200 250 300 350 400 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Number of Compute or I/O Nodes Performance MB/s

read average read maximum write average write maximum

SFIO on the Swiss-Tx machine

The performance of SFIO is measured for concurrent access from all compute nodes to all I/O nodes. In order to limit operating system caching ef- fects, the total size of the striped file linearly increases with the number of I/O nodes up to 32GB. The stripe unit size is 200 bytes. The application’s I/O performance is measured as a function of the number of Compute and I/O nodes.

SLIDE 12

.

Conclusion

Implementation of blocking collective file access operations.
Implementation of non-blocking file access operations.
The remaining File Manipulation Operations.
Switching to SFIO.

Thank You ! SFIO

Isolated solution automatically gives to every MPI-1 owner an MPI-I/O, without any requirements of changing, modifying, or specifically interfering to his current MPI-1 implementation.

Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. - - PDF document

Isolated MPI-I/O Solution on top of MPI-1

Emin Gabrielyan, Roger D. Hersch École Polytechnique Fédérale de Lausanne, Switzerland {Emin.Gabrielyan,RD.Hersch}@epfl.ch SFIO

5th Workshop on Distributed Supercomputing:

Scalable Cluster Software

MPI-I/O Access Operations

File View

Derived Datatypes

MPI-I/O Implementation

Reverse Engineering or Memory Painting

Portable MPI-I/O Solution

MPI-I/O Isolation

Testing Isolated MPI-I/O

Gateway to the Parallel I/O of the Swiss-T1

SFIO on the Swiss-Tx machine

Conclusion

Thank You ! SFIO

Future work