Using TCP Through So c k ets Da vid Mazi eres - - PDF document

using tcp through so c k ets da vid mazi eres dm
SMART_READER_LITE
LIVE PREVIEW

Using TCP Through So c k ets Da vid Mazi eres - - PDF document

Using TCP Through So c k ets Da vid Mazi eres dm@amsterdam.lcs.mit.edu 1 File descriptors 1 Most I/O on Unix systems tak es place through the and system calls . Before read write discussing net w ork I/O,


slide-1
SLIDE 1 Using TCP Through So c k ets Da vid Mazi
  • eres
dm@amsterdam.lcs.mit.edu 1 File descriptors Most I/O
  • n
Unix systems tak es place through the read and write system calls 1 . Before discussing net w
  • rk
I/O, it helps to understand ho w these functions w
  • rk
ev en
  • n
simple les. If y
  • u
are already familiar with le descriptors and the read and write system calls, y
  • u
can skip to the next section. Section 1.1 sho ws a v ery simple program that prin ts the con ten ts
  • f
les to the standard
  • utput|just
lik e the UNIX cat command. The function typefile uses four system calls to cop y the con ten ts
  • f
a le to the standard
  • utput.
  • int
  • pen(char
*path, int flags, ...); The
  • p
en system call requests access to a particular le. path sp ecies the name
  • f
the le to access; flags determines the t yp e
  • f
access b eing requested|in this case read-only access.
  • p
en ensures that the named le exists (or can b e created, dep ending
  • n
flags) and c hec ks that the in v
  • king
user has sucien t p ermission for the mo de
  • f
access. If successful,
  • p
en returns a non-negativ e in teger kno wn as a le descriptor. All read and write
  • p
erations m ust b e p erformed
  • n
le descriptors. File descriptors remain b
  • und
to les ev en when les are renamed
  • r
deleted
  • r
undergo p ermission c hanges that rev
  • k
e access 2 . By con v en tion, le descriptors n um b ers 0, 1, and 2 corresp
  • nd
to standard input, standard
  • utput,
and standard error resp ectiv ely . Th us a call to p rintf will result in a write to le descriptor 1. If unsuccessful,
  • p
en returns 1 and sets the global v ariable errno to indicate the nature
  • f
the error. The routine p erro r will prin t \lename: error message" to the standard error based
  • n
errno.
  • int
read (int fd, void *buf, int nbytes); read will read up to nbytes b ytes
  • f
data in to memory starting at buf. It returns the n um b er
  • f
b ytes actually read, whic h ma y v ery w ell b e less than nbytes. If it returns 0, this indicates an end
  • f
le. If it returns 1, this indicates an error. 1 High-lev el I/O functions suc h as fread and fp rintf are implemen ted in terms
  • f
read and write. 2 Note that not all net w
  • rk
le systems prop erly implemen t these seman tics. 1
slide-2
SLIDE 2
  • int
write (int fd, void *buf, int nbytes); write will write up to nbytes b ytes
  • f
data at buf to le descriptor fd. It returns the n um b er
  • f
b ytes actually written, whic h unfortunately ma y b e less than nbytes in some circumstances. W rite returns to indicate an end
  • f
le, and 1 to indicate an error.
  • int
close (int fd); close deallo cates a le descriptor. Systems t ypically limit eac h pro cess to 64 le de- scriptors b y default (though the limit can sometimes b e raised substan tially with the setrlimit system call). Th us, it is a go
  • d
idea to close le descriptors after their last use so as to prev en t \to
  • man
y
  • p
en les" errors. 1.1 type.c: Cop y le to standard
  • utput
#include <stdio.h> #include <unistd.h> #include <fcntl.h> void typefile (char *filename) { int fd, nread; char buf[1024]; fd =
  • pen
(filename, O_RDONLY); if (fd ==
  • 1)
{ perror (filename); return; } while ((nread = read (fd, buf, sizeof (buf))) > 0) write (1, buf, nread); close (fd); } int main (int argc, char **argv) { int argno; for (argno = 1; argno < argc; argno++) typefile (argv[argno]); exit (0); } 2
slide-3
SLIDE 3 2 TCP/IP Connections 2.1 In tro duction TCP is the reliable proto col man y applications use to comm unicate
  • v
er the In ternet. TCP pro vides a stream abstraction: Tw
  • pro
cesses, p
  • ssibly
  • n
dieren t mac hines, eac h ha v e a le descriptor. Data written to either descriptor will b e returned b y a read from the
  • ther.
Suc h net w
  • rk
le descriptors are called so c k ets in Unix. Ev ery mac hine
  • n
the In ternet has a unique, 32-bit IP (In ternet proto col) address. An IP address is sucien t to route net w
  • rk
pac k ets to a mac hine from an ywhere
  • n
the In ter- net. Ho w ev er, since m ultiple applications can use TCP sim ultaneously
  • n
the same mac hine, another lev el
  • f
addressing is needed to disam biguate whic h pro cess and le descriptor in- coming TCP pac k ets corresp
  • nd
to. F
  • r
this reason, eac h end
  • f
a TCP connection is named b y 16-bit p
  • rt
n um b er in addition to its 32-bit IP address. So ho w do es a TCP connection get set up? T ypically , a serv er will listen for connections
  • n
an IP address and p
  • rt
n um b er. Clien ts can then allo cate their
  • wn
p
  • rts
and connect to that serv er. Serv ers usually listen
  • n
w ell-kno wn p
  • rts.
F
  • r
instance, nger serv ers listen
  • n
p
  • rt
79, w eb serv ers
  • n
p
  • rt
80 and mail serv ers
  • n
p
  • rt
25. A list
  • f
w ell-kno wn p
  • rt
n um b ers can b e found in the le /etc/services
  • n
an y Unix mac hine. The Unix telnet utilit y will allo w to y
  • u
connect to TCP serv ers and in teract with them. By default, telnet connects to p
  • rt
23 and sp eaks to a telnet daemon that runs login. Ho w ev er, y
  • u
can sp ecify a dieren t p
  • rt
n um b er. F
  • r
instance, p
  • rt
7
  • n
man y mac hines runs a TCP ec ho serv er: athena% telnet athena.dialup.mit.edu 7 ...including Athena's default telnet
  • ptions:
"-ax" Trying 18.184.0.39... Connected to ten-thousand-dollar-bil l.di alup .mi t.ed u. Escape character is '^]'. repeat after me... repeat after me... The echo server works! The echo server works! quit quit ^] telnet> q Connection closed. athena% Note that in
  • rder
to quit telnet, y
  • u
m ust t yp e Con trol-] follo w ed b y q and return. The ec ho serv er will happily ec ho an ything y
  • u
t yp e lik e quit. As another example, let's lo
  • k
at the nger proto col,
  • ne
  • f
the simplest widely used TCP proto cols. The Unix finger command tak es a single argumen t
  • f
the form user@host. It then connects to p
  • rt
79
  • f
host, writes the user string and a carriage-return line-feed 3
slide-4
SLIDE 4
  • v
er the connection, and dumps whatev er the serv er sends bac k to the standard
  • utput.
W e can sim ulate the finger command using telnet. F
  • r
instance, using telnet to do the equiv alen t
  • f
the command finger help@mit.edu, w e get: athena% telnet mit.edu 79 ...including Athena's default telnet
  • ptions:
"-ax" Trying 18.72.0.100... Connected to mit.edu. Escape character is '^]'. help These help topics are available: about general
  • ptions
restrictions url change-info motd policy services wildcards To view
  • ne
  • f
these topics, enter "help name-of-topic-you-want". ... Connection closed by foreign host. athena% 2.2 TCP clien t programming No w let's see ho w to mak e use
  • f
so c k ets in C . Section 2.3 sho ws the source co de to a simple nger clien t that do es the equiv alen t
  • f
the last telnet example
  • f
the previous section. The function tcpconnect sho ws all the steps necessary to connect to a TCP serv er. It mak es the follo wing system calls:
  • int
socket (int domain, int type, int protocol); The so ck et system call creates a new so c k et, just as
  • p
en creates a new le descriptor. so ck et returns a non-negativ e le descriptor n um b er
  • n
success,
  • r
1
  • n
an error. When creating a TCP so c k et, domain should b e AF INET, signifying an IP so c k et, and type should b e SOCK STREAM, signifying a reliable stream. Since the reliable stream proto col for IP is TCP , the rst t w
  • argumen
ts already eectiv ely sp ecify TCP . Th us, the third argumen t can b e left 0, letting the Op erating System assign a default proto col (whic h will b e IPPROTO TCP). Unlik e le descriptors returned b y
  • p
en, y
  • u
can't immediately read and write data to a so c k et returned b y so ck et. Y
  • u
m ust rst assign the so c k et a lo cal IP address and p
  • rt
n um b er, and in the case
  • f
TCP y
  • u
need to connect the
  • ther
end
  • f
the so c k et to a remote mac hine. The bind and connect system calls accomplish these tasks.
  • int
bind (int s, struct sockaddr *addr, int addrlen); bind sets the lo cal address and p
  • rt
n um b er
  • f
a so c k et. s is the le descriptor n um b er
  • f
a so c k et. F
  • r
IP so c k ets, addr m ust b e a structure
  • f
t yp e sockaddr in, usually as follo ws (in /usr/include/netinet/in. h). addrlen m ust b e the size
  • f
struct sockaddr in (or whic hev er structure
  • ne
is using). 4
slide-5
SLIDE 5 struct in_addr { u_int32_t s_addr; }; struct sockaddr_in { short sin_family; u_short sin_port; struct in_addr sin_addr; char sin_zero[8]; }; Dieren t v ersions
  • f
Unix ma y ha v e sligh tly dieren t structures. Ho w ev er, all will ha v e the elds sin family, sin port, and sin addr. All
  • ther
elds should b e set to zero. Th us, b efore using a struct sockaddr in, y
  • u
m ust call bzero
  • n
it, as is done in tcpconnect. Once a struct sockaddr in has b een zero ed, the sin family eld m ust b e set to the v alue AF INET to indicate that this is indeed a sockaddr in. (Bind cannot tak e this for gran ted, as its argumen t is a more generic struct sockaddr *.) sin port sp ecies whic h 16-bit p
  • rt
n um b er to use. It is giv en in net w
  • rk
(big-endian) b yte
  • rder,
and so m ust b e con v erted from host to net w
  • rk
b yte
  • rder
with htons. It is
  • ften
the case when writing a TCP clien t that
  • ne
w an ts a p
  • rt
n um b er but do esn't care whic h
  • ne.
Sp ecifying a sin port v alue
  • f
tells the OS to c ho
  • se
the p
  • rt
n um b er. The
  • p
erating system will select an un used p
  • rt
n um b er b et w een 1,024 and 5,000 for the application. Note that
  • nly
the sup er-user can bind p
  • rt
n um b ers under 1,024. Man y system services suc h as mail serv ers listen for connections
  • n
w ell-kno wn p
  • rt
n um b ers b elo w 1,024. Allo wing
  • rdinary
users to bind these p
  • rts
w
  • uld
p
  • ten
tially also allo w them to do things lik e in tercept mail with their
  • wn
rogue mail serv ers. sin addr con tains a 32-bit IP address for the lo cal end
  • f
a so c k et. The sp ecial v alue INADDR ANY tells the
  • p
erating system to c ho
  • se
the IP address. This is usually what
  • ne
w an ts when binding a so c k et, since co de t ypically do es not care ab
  • ut
the IP address
  • f
the mac hine
  • n
whic h it is running.
  • int
connect (int s, struct sockaddr *addr, int addrlen); connect sp ecies the address
  • f
the remote end
  • f
a so c k et. The argumen ts are the same as for bind, with the exception that
  • ne
cannot sp ecify a p
  • rt
n um b er
  • f
  • r
an IP address
  • f
INADDR ANY. Connect returns
  • n
success
  • r
1
  • n
failure. Note that
  • ne
can call connect
  • n
a TCP so c k et without rst calling bind. In that case, connect will assign the so c k et a lo cal address as if the so c k et had b een b
  • und
to p
  • rt
n um b er with address INADDR ANY. The example nger calls bind for illustrativ e purp
  • ses
  • nly
. These three system calls create a connected TCP so c k et,
  • v
er whic h the nger program writes the name
  • f
the user b eing ngered and reads the resp
  • nse.
Most
  • f
the rest
  • f
the co de should b e straigh t-forw ard, except y
  • u
migh t wish to note the use
  • f
gethostbyname to translate a hostname in to a 32-bit IP address. 5
slide-6
SLIDE 6 2.3 myfinger.c: A simple net w
  • rk
nger clien t #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <netdb.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #define FINGER_PORT 79 #define bzero(ptr, size) memset (ptr, 0, size) /* Create a TCP connection to host and port. Returns a file * descriptor
  • n
success,
  • 1
  • n
error. */ int tcpconnect (char *host, int port) { struct hostent *h; struct sockaddr_in sa; int s; /* Get the address
  • f
the host at which to finger from the * hostname. */ h = gethostbyname (host); if (!h || h->h_length != sizeof (struct in_addr)) { fprintf (stderr, "%s: no such host\n", host); return
  • 1;
} /* Create a TCP socket. */ s = socket (AF_INET, SOCK_STREAM, 0); /* Use bind to set an address and port number for
  • ur
end
  • f
the * finger TCP connection. */ bzero (&sa, sizeof (sa)); sa.sin_family = AF_INET; sa.sin_port = htons (0); /* tells OS to choose a port */ sa.sin_addr.s_add r = htonl (INADDR_ANY); /* tells OS to choose IP addr */ if (bind (s, (struct sockaddr *) &sa, sizeof (sa)) < 0) { perror ("bind"); close (s); return
  • 1;
} /* Now use h to set set the destination address. */ sa.sin_port = htons (port); sa.sin_addr = *(struct in_addr *) h->h_addr; /* And connect to the server */ if (connect (s, (struct sockaddr *) &sa, sizeof (sa)) < 0) { perror (host); 6
slide-7
SLIDE 7 close (s); return
  • 1;
} return s; } int main (int argc, char **argv) { char *user; char *host; int s; int nread; char buf[1024]; /* Get the name
  • f
the host at which to finger from the end
  • f
the * command line argument. */ if (argc == 2) { user = malloc (1 + strlen (argv[1])); if (!user) { fprintf (stderr, "out
  • f
memory\n"); exit (1); } strcpy (user, argv[1]); host = strrchr (user, '@'); } else user = host = NULL; if (!host) { fprintf (stderr, "usage: %s user@host\n", argv[0]); exit (1); } *host++ = '\0'; /* Try connecting to the host. */ s = tcpconnect (host, FINGER_PORT); if (s < 0) exit (1); /* Send the username to finger */ if (write (s, user, strlen (user)) < || write (s, "\r\n", 2) < 0) { perror (host); exit (1); } /* Now copy the result
  • f
the finger command to stdout. */ while ((nread = read (s, buf, sizeof (buf))) > 0) write (1, buf, nread); exit (0); } 7
slide-8
SLIDE 8 2.4 TCP serv er programming No w let's lo
  • k
at what happ ens in a TCP serv er. Section 2.5 sho ws the complete source co de to a simple nger serv er. It listens for clien ts
  • n
the nger p
  • rt,
79. Then, for eac h connection established, it reads a line
  • f
data, in terprets it as the name
  • f
a user to nger, and runs the lo cal nger utilit y directing its
  • utput
bac k
  • v
er the so c k et to the clien t. The function tcpserv tak es a p
  • rt
n um b er as an argumen t, binds a so c k et to that p
  • rt,
tells the k ernel to listen for TCP connections
  • n
that so c k et, and returns the so c k et le descriptor n um b er,
  • r
1
  • n
an error. This requires three main system calls:
  • int
socket (int domain, int type, int protocol); This function creates a so c k et, as describ ed in Section 2.2.
  • int
bind (int s, struct sockaddr *addr, int addrlen); This function assigns an address to a so c k et, as describ ed in Section 2.2. Unlik e the nger clien t, whic h did not care ab
  • ut
its lo cal p
  • rt
n um b er, here w e sp ecify a sp ecic p
  • rt
n um b er. Binding a sp ecic p
  • rt
n um b er can cause complications when killing and restarting serv ers (for instance during debugging). Closed TCP connections can sit for a while in a state called TIME WAIT b efore disapp earing en tirely . This can prev en t a restarted TCP serv er from binding the same p
  • rt
n um b er again, ev en if the
  • ld
pro cess no longer exists. The setso ck
  • pt
system call sho wn in tcpserv a v
  • ids
this problem. It tells the
  • p
erating system to let the so c k et b e b
  • und
to a p
  • rt
n um b er already in use.
  • int
listen (int s, int backlog); listen tells the
  • p
erating system to accept net w
  • rk
connections. It returns
  • n
success, and 1
  • n
error. s is an unconnected so c k et b
  • und
to the p
  • rt
  • n
whic h to accept connections. backlog formerly sp ecied the n um b er
  • f
connections the
  • p
erating sys- tem w
  • uld
accept ahead
  • f
the application. That argumen t is ignored b y most mo dern Unix
  • p
erating systems, ho w ev er. P eople traditionally use the v alue 5. Once y
  • u
ha v e called listen
  • n
a so c k et, y
  • u
cannot call connect, read,
  • r
write, as the so c k et has no remote end. Instead, a new system call, accept, creates a new so c k et for eac h clien t connecting to the p
  • rt
s is b
  • und
to. Once tcpserv has b egun listening
  • n
a so c k et, main accepts connections from clien ts, with the system call accept.
  • int
accept (int s, struct sockaddr *addr, int *addrlenp); Accept tak es a so c k et s
  • n
whic h
  • ne
is listening and returns a new so c k et to whic h a clien t has just connected. If no clien ts ha v e connected, accept will blo c k un til
  • ne
do es. accept returns 1
  • n
an error. F
  • r
TCP , addr should b e a struct sockaddr in *. addrlenp m ust b e a p
  • in
ter to an in teger con taining the v alue sizeof (struct sockaddr in). accept will ad- just *addrlenp to con tain the actual length
  • f
the struct sockaddr it copies in to 8
slide-9
SLIDE 9 *addr. In the case
  • f
TCP , all struct sockaddr in's are the same size, so *addrlenp shouldn't c hange. The nger daemon mak es use
  • f
a few more Unix system calls whic h, while not net w
  • rk-
sp ecic, are
  • ften
encoun tered in net w
  • rk
serv ers. With fo rk it creates a new pro cess. This new pro cess calls dup2 to redirect its standard
  • utput
and error
  • v
er the accepted so c k et. Finally , a call to execl replaces the new pro cess with an instance
  • f
the nger program. Finger inherits its standard
  • utput
and error, so these go straigh t bac k
  • v
er the net w
  • rk
to the clien t.
  • int
fork (void); fo rk creates a new pro cess, iden tical to the curren t
  • ne.
In the
  • ld
pro cess, fo rk returns a pro cess ID
  • f
the new pro cess. In the new
  • r
\c hild" pro cess, fo rk returns 0. fo rk returns 1 if there is an error.
  • int
dup2(int
  • ldfd,
int newfd); dup2 closes le descriptor n um b er newfd, and replaces it with a cop y
  • f
  • ldfd.
When the second argumen t is 1, this c hanges the destination
  • f
the standard
  • utput.
When that argumen t is 2, it c hanges the standard error.
  • int
execl(char *path, char *arg0, ..., NULL); The execl system call runs a command|as if y
  • u
had t yp ed it
  • n
the command line. The command executed will inherit all le descriptors except those with the close-on- exec ag set. execl replaces the curren tly executing program with the
  • ne
sp ecied b y path. On success, it therefore do esn't return. On failure, it returns 1. 2.5 myfingerd.c: A simple net w
  • rk
nger serv er #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <netdb.h> #include <signal.h> #include <fcntl.h> #include <errno.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> #define FINGER_PORT 79 #define FINGER_COMMAND "/usr/bin/finger" #define bzero(ptr, size) memset (ptr, 0, size) /* Create a TCP socket, bind it to a particular port, and call listen * for connections
  • n
it. These are the three steps necessary before * clients can connect to a server. */ 9
slide-10
SLIDE 10 int tcpserv (int port) { int s, n; struct sockaddr_in sin; /* The address
  • f
this server */ bzero (&sin, sizeof (sin)); sin.sin_family = AF_INET; sin.sin_port = htons (port); /* We are interested in listening
  • n
any and all IP addresses this * machine has, so use the magic IP address INADDR_ANY. */ sin.sin_addr.s_ad dr = htonl (INADDR_ANY); s = socket (AF_INET, SOCK_STREAM, 0); if (s < 0) { perror ("socket"); return
  • 1;
} /* Allow the program to run again even if there are
  • ld
connections * in TIME_WAIT. This is the magic you need to do to avoid seeing * "Address already in use" errors when you are killing and * restarting the daemon frequently. */ n = 1; if (setsockopt (s, SOL_SOCKET, SO_REUSEADDR, (char *)&n, sizeof (n)) < 0) { perror ("SO_REUSEADDR"); close (s); return
  • 1;
} /* This function sets the close-on-exec bit
  • f
a file descriptor. * That way no programs we execute will inherit the TCP server file * descriptor. */ fcntl (s, F_SETFD, 1); if (bind (s, (struct sockaddr *) &sin, sizeof (sin)) < 0) { fprintf (stderr, "TCP port %d: %s\n", port, strerror (errno)); close (s); return
  • 1;
} if (listen (s, 5) < 0) { perror ("listen"); close (s); return
  • 1;
} return s; } /* Read a line
  • f
input from a file descriptor and return it. Returns * NULL
  • n
EOF/error/out
  • f
memory. May
  • ver-read,
so don't use this * if there is useful data after the first line. */ static char * 10
slide-11
SLIDE 11 readline (int s) { char *buf = NULL, *nbuf; int buf_pos = 0, buf_len = 0; int i, n; for (;;) { /* Ensure there is room in the buffer */ if (buf_pos == buf_len) { buf_len = buf_len ? buf_len << 1 : 4; nbuf = realloc (buf, buf_len); if (!nbuf) { free (buf); return NULL; } buf = nbuf; } /* Read some data into the buffer */ n = read (s, buf + buf_pos, buf_len
  • buf_pos);
if (n <= 0) { if (n < 0) perror ("read"); else fprintf (stderr, "read: EOF\n"); free (buf); return NULL; } /* Look for the end
  • f
a line, and return if we got it. Be * generous in what we consider to be the end
  • f
a line. */ for (i = buf_pos; i < buf_pos + n; i++) if (buf[i] == '\0' || buf[i] == '\r' || buf[i] == '\n') { buf[i] = '\0'; return buf; } buf_pos += n; } } static void runfinger (int s) { char *user; /* Read the username being fingered. */ user = readline (s); /* Now connect standard input and standard
  • utput
to the socket, * instead
  • f
the invoking user's terminal. */ if (dup2 (s, 1) < || dup2 (s, 2) < 0) { perror ("dup2"); 11
slide-12
SLIDE 12 exit (1); } close (s); /* Run the finger command. It will inherit
  • ur
standard
  • utput
and * error, and therefore send its results back
  • ver
the network. */ execl (FINGER_COMMAND , "finger", "--", *user ? user : NULL, NULL); /* We should never get here, unless we couldn't run finger. */ perror (FINGER_COMMAND); exit (1); } int main (int argc, char **argv) { int ss, cs; struct sockaddr_in sin; int sinlen; int pid; /* This system call allows
  • ne
to call fork without worrying about * calling wait. Don't worry about what it means unless you start * caring about the exit status
  • f
forked processes, in which case * you should delete this line and read the manual pages for wait * and waitpid. For a description
  • f
what this signal call really * does, see the manual page for sigaction and look for * SA_NOCLDWAIT. Signal is an
  • lder
signal interface which when * invoked this way is equivalent to setting SA_NOCLDWAIT. */ signal (SIGCHLD, SIG_IGN); ss = tcpserv (FINGER_PORT); if (ss < 0) exit (1); for (;;) { sinlen = sizeof (sin); cs = accept (ss, (struct sockaddr *) &sin, &sinlen); if (cs < 0) { perror ("accept"); exit (1); } printf ("connection from %s\n", inet_ntoa (sin.sin_addr)); pid = fork (); if (!pid) /* Child process */ runfinger (cs); close (cs); } } 12
slide-13
SLIDE 13 3 Non-blo c king I/O 3.1 The O NONBLOCK ag The nger clien t in Section 2.3 is
  • nly
as fast as the serv er to whic h it talks. When the program calls connect, read, and sometimes ev en write, it m ust w ait for a resp
  • nse
from the serv er b efore making an y further progress. This do esn't
  • rdinarily
p
  • se
a problem; if nger blo c ks, the
  • p
erating system will sc hedule another pro cess so the CPU can still p erform useful w
  • rk.
On the
  • ther
hand, supp
  • se
y
  • u
w an t to nger some h uge n um b er
  • f
users. Some serv ers ma y tak e a long time to resp
  • nd
(for instance, connection attempts to unreac hable serv ers will tak e
  • v
er a min ute to time
  • ut).
Th us, y
  • ur
program itself ma y ha v e plen t y
  • f
useful w
  • rk
to do, and y
  • u
ma y not w an t to sc hedule another pro cess ev ery time a serv er is slo w to resp
  • nd.
F
  • r
this reason, Unix allo ws le descriptors to b e placed in a non-blo c king mo de. A bit asso ciated with eac h le descriptor O NONBLOCK, determines whether it is in non-blo c king mo de
  • r
not. Section 3.4 sho ws some utilit y functions for non-blo c king I/O. The function make async sets the O NONBLOCK bit
  • f
a le descriptor non-blo c king with the fcntl system call. Man y system calls b eha v e sligh tly dieren tly
  • n
le descriptors whic h ha v e O NONBLOCK set:
  • read.
When there is data to read, read b eha v es as usual. When there is an end
  • f
le, read still returns 0. If, ho w ev er, a pro cess calls read
  • n
a non-blo c king le descriptor when there is no data to b e read y et, instead
  • f
w aiting for data, read will return 1 and set errno to EAGAIN.
  • write.
Lik e read, write will return
  • n
an end
  • f
le, and 1 with an errno
  • f
EAGAIN if there is no buer space. If, ho w ev er, there is some buer space but not enough to con tain the en tire write request, write will tak e as m uc h data as it can and return a v alue smaller than the length sp ecied as its third argumen t. Co de m ust handle suc h \short writes" b y calling write again later
  • n
the rest
  • f
the data.
  • connect.
A TCP connection request requires a resp
  • nse
from the listening serv er. When called
  • n
a non-blo c king so c k et, connect cannot w ait for suc h a resp
  • nse
b efore returning. F
  • r
this reason, connect
  • n
a non-blo c king so c k et usually returns 1 with errno set to EINPROGRESS. Occasionally , ho w ev er, connect succeeds
  • r
fails immediately ev en
  • n
a non-blo c king so c k et, so y
  • u
m ust b e prepared to handle this case.
  • accept.
When there are connections to accept, accept will b eha v e as usual. If there are no p ending connections, ho w ev er, accept will return 1 and set errno to EWOULDBLOCK. It's w
  • rth
noting that le descriptors returned b y accept ha v e O NONBLOCK clear, whether
  • r
not the listening so c k et is non-blo c king. In a async hronous serv ers,
  • ne
  • ften
sets O NONBLOCK immediately
  • n
an y le descriptors accept returns. 13
slide-14
SLIDE 14 3.2 select: Finding
  • ut
when so c k ets are ready O NONBLOCK allo ws an application to k eep the CPU when an I/O system call w
  • uld
  • rdinarily
blo c k. Ho w ev er, programs can use sev eral non-blo c king le descriptors and still nd none
  • f
them ready for I/O. Under suc h circumstances, programs need a w a y to a v
  • id
w asting CPU time b y rep eatedly p
  • lling
individual le descriptors. The select system call solv es this problem b y letting applications sleep un til
  • ne
  • r
more le descriptors in a set is ready for I/O. select usage
  • int
select (int nfds, fd_set *rfds, fd_set *wfds, fd_set *efds, struct timeval *timeout); select tak es p
  • in
ters to sets
  • f
le descriptors and a timeout. It returns when
  • ne
  • r
more
  • f
the le descriptors are ready for I/O,
  • r
after the sp ecied timeout. Before returning, select mo dies the le descriptor sets so as to indicate whic h le descriptors actually are ready for I/O. select returns the n um b er
  • f
ready le descriptors,
  • r
1
  • n
an error. select represen ts sets
  • f
le descriptors as bit v ectors|one bit p er descriptor. The rst bit
  • f
a v ector is 1 if that set con tains le descriptor 0, the second bit is 1 if it con tains descriptor 1, and so
  • n.
The argumen t nfds sp ecies the n um b er
  • f
bits in eac h
  • f
the bit v ectors b eing passed in. Equiv alen tly , nfds is
  • ne
more than highest le descriptor n um b er select m ust c hec k
  • n.
These le descriptor sets are
  • f
t yp e fd set. Sev eral macros in system header les allo w easy manipulation
  • f
this t yp e. If fd is an in teger con taining a le descriptor, and fds is a v ariable
  • f
t yp e fd set, the follo wing macros can manipulate fds: { FD_ZERO (&fds); Clears all bits in a fds. { FD_SET (fd, &fds); Sets the bit corresp
  • nding
to le descriptor fd in fds. { FD_CLR (fd, &fds); Clears the bit corresp
  • nding
to le descriptor fd in fds. { FD_ISSET (fd, &fds); Returns a true if and
  • nly
if the bit for le descriptor fd is set in fds. select tak es three le descriptor sets as input. rfds sp ecies the set
  • f
le descriptors
  • n
whic h the pro cess w
  • uld
lik e to p erform a read
  • r
accept. wfds sp ecies the set
  • f
le descriptors
  • n
whic h the pro cess w
  • uld
lik e to p erform a write. efds is a set
  • f
le descriptors for whic h the pro cess is in terested in exceptional ev en ts suc h as the arriv al
  • f
  • ut
  • f
band data. In practice, p eople rarely use efds. An y
  • f
the fd set * argumen ts to select can b e NULL to indicate an empt y set. 14
slide-15
SLIDE 15 The argumen t timeout sp ecies the amoun t
  • f
time to w ait for a le descriptor to b ecome ready . It is a p
  • in
ter to a structure
  • f
the follo wing form: struct timeval { long tv_sec; /* seconds */ long tv_usec; /* and microseconds */ }; timeout can also b e NULL, in whic h case select will w ait indenitely . Tips and subtleties File descriptor limits. Programmers using select ma y b e tempted to write co de capable
  • f
using arbitrarily man y le descriptors. Be a w are that the
  • p
erating system limits the n um b er
  • f
le descriptors a pro cess can ha v e. If y
  • u
don't b
  • und
the n um b er
  • f
descriptors y
  • ur
program uses, y
  • u
m ust b e prepared for system calls lik e so ck et and accept to fail with errors lik e EMFILE. By default, a mo dern Unix system t ypically limits pro cesses to 64 le descriptors (though the setrlimit system call can sometimes raise that limit substan tially). Don't coun t
  • n
using all 64 le descriptors, either. All pro cesses inherit at least three le descriptors (standard input,
  • utput,
and error), and some C library functions need to use le descriptors, to
  • .
It should b e safe to assume y
  • u
can use 56 le descriptors, though. If y
  • u
do raise the maxim um n um b er
  • f
le descriptors allo w ed to y
  • ur
pro cess, there is another problem to b e a w are
  • f.
The fd set t yp e denes a v ector with FD SETSIZE bits in it (t ypically 256). If y
  • ur
program uses more than FD SETSIZE le descriptors, y
  • u
m ust allo cate more memory for eac h v ector than than an fd set con tains, and y
  • u
can no longer use the FD ZERO macro. Using select with connect. After connecting a non-blo c king so c k et, y
  • u
migh t lik e to kno w when the connect has completed and whether it succeeded
  • r
failed. TCP serv ers can accept connections without writing to them (for instance,
  • ur
nger serv er w aited to read a username b efore sending an ything bac k
  • v
er the so c k et). Th us, selecting for readabilit y will not necessarily notify y
  • u
  • f
a connect's completion; y
  • u
m ust c hec k for writabilit y . When select do es indicate the writabilit y
  • f
a non-blo c king so c k et with a p ending connect, ho w can y
  • u
tell if that connect succeeded? The simplest w a y is to try writing some data to the le descriptor to see if the write succeeds. This approac h has t w
  • small
complications. First, writing to an unconnected so c k et do es more than simply return an error co de; it kills the curren t pro cess with a SIGPIPE signal. Th us, an y program that risks writing to an unconnected so c k et should tell the
  • p
erating system that it w an ts to ignore SIGPIPE. The signal system call accomplishes this: signal (SIGPIPE, SIG_IGN); The second complication is that y
  • u
ma y not ha v e an y data to write to a so c k et, y et still wish to kno w if a non-blo c king connect has succeeded. In that case, y
  • u
can nd
  • ut
whether a so c k et is connected with the getp eername system call. getp eername tak es the same argumen t t yp es as accept, but exp ects a connected so c k et as its rst argumen t. If getp eername returns 15
slide-16
SLIDE 16 (meaning success), then y
  • u
kno w the non-blo c king connect has succeeded. If it returns 1, then the connect has failed. Example Section 3.4 sho ws a simple select-based dispatc her. There are three functions:
  • void
cb_add (int fd, int write, void (*fn)(void *), void *arg); T ells the dispatc her to call function fn with argumen t arg when le descriptor fd is ready for reading, if write is 0,
  • r
writing,
  • therwise.
  • void
cb_free (int fd, int write); T ells the dispatc her it should no longer call an y function when le descriptor fd is ready for reading
  • r
writing (dep ending
  • n
the v alue
  • f
write).
  • void
cb_check (void); W ait un til
  • ne
  • r
more
  • f
the registered le descriptors is ready , and mak e an y appro- priate callbac ks. The function cb add main tains t w
  • fd
set v ariables, rfds for descriptors with read callbac ks, and wfds for
  • nes
with write callbac ks. It also records the function calls it needs to mak e in t w
  • arra
ys
  • f
cb (\callbac k") structures. cb check calls select
  • n
the le descriptors in rfds and wfds. Since select
  • v
erwrites the fd set structures it gets, cb check m ust rst cop y the sets it is c hec king. cb check then lo
  • ps
through the sets
  • f
ready descriptors making an y appropriate function calls. 3.3 async.h: In terface to async.c #ifndef _ASYNC_H_ /* Guard against multiple inclusion */ #define _ASYNC_H_ 1 /* Enable stress-tests. */ /* #define SMALL_LIMITS 1 */ #include <sys/types.h> #if __GNUC__ != 2 /* The __attribute__ keyword helps make gcc
  • Wall
more useful, but * doesn't apply to
  • ther
C compilers. You don't need to worry about * what __attribute__ does (though if you are curious you can consult * the gcc info pages). */ #define __attribute__(x ) #endif /* __GNUC__ != 2 */ /* 1 + highest file descriptor number expected */ #define FD_MAX 64 /* The number
  • f
TCP connections we will use. This can be no higher 16
slide-17
SLIDE 17 * than FD_MAX, but we reserve a few file descriptors because 0, 1, * and 2 are already in use as stdin, stdout, and stderr. Moreover, * libc can make use
  • f
a few file descriptors for functions like * gethostbyname. */ #define NCON_MAX FD_MAX
  • 8
void fatal (const char *msg, ...) __attribute__ ((noreturn, format (printf, 1, 2))); void make_async (int); /* Malloc-like functions that don't fail. */ void *xrealloc (void *, size_t); #define xmalloc(size) xrealloc (0, size) #define xfree(ptr) xrealloc (ptr, 0) #define bzero(ptr, size) memset (ptr, 0, size) void cb_add (int, int, void (*fn)(void *), void *arg); void cb_free (int, int); void cb_check (void); #endif /* !_ASYNC_H_ */ 3.4 async.c: Handy routines for async hronous I/O #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdarg.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <assert.h> #include <sys/socket.h> #include "async.h" /* Callback to make when a file descriptor is ready */ struct cb { void (*cb_fn) (void *); /* Function to call */ void *cb_arg; /* Argument to pass function */ }; static struct cb rcb[FD_MAX], wcb[FD_MAX]; /* Per fd callbacks */ static fd_set rfds, wfds; /* Bitmap
  • f
cb's in use */ void cb_add (int fd, int write, void (*fn)(void *), void *arg) { struct cb *c; assert (fd >= && fd < FD_MAX); c = &(write ? wcb : rcb)[fd]; c->cb_fn = fn; c->cb_arg = arg; 17
slide-18
SLIDE 18 FD_SET (fd, write ? &wfds : &rfds); } void cb_free (int fd, int write) { assert (fd >= && fd < FD_MAX); FD_CLR (fd, write ? &wfds : &rfds); } void cb_check (void) { fd_set trfds, twfds; int i, n; /* Call select. Since the fd_sets are both input and
  • utput
* arguments, we must copy rfds and wfds. */ trfds = rfds; twfds = wfds; n = select (FD_MAX, &trfds, &twfds, NULL, NULL); if (n < 0) fatal ("select: %s\n", strerror (errno)); /* Loop through and make callbacks for all ready file descriptors */ for (i = 0; n && i < FD_MAX; i++) { if (FD_ISSET (i, &trfds)) { n--; /* Because any
  • ne
  • f
the callbacks we make might in turn call * cb_free
  • n
a higher numbered file descriptor, we want to make * sure each callback is wanted before we make it. Hence check * rfds. */ if (FD_ISSET (i, &rfds)) rcb[i].cb_fn (rcb[i].cb_arg); } if (FD_ISSET (i, &twfds)) { n--; if (FD_ISSET (i, &wfds)) wcb[i].cb_fn (wcb[i].cb_arg); } } } void make_async (int s) { int n; /* Make file file descriptor nonblocking. */ if ((n = fcntl (s, F_GETFL)) < || fcntl (s, F_SETFL, n | O_NONBLOCK) < 0) fatal ("O_NONBLOCK: %s\n", strerror (errno)); 18
slide-19
SLIDE 19 /* You can pretty much ignore the rest
  • f
this function... */ /* Many asynchronous programming errors
  • ccur
  • nly
when slow peers * trigger short writes. To simulate this during testing, we set * the buffer size
  • n
the socket to 4 bytes. This will ensure that * each read and write
  • peration
works
  • n
at most 4 bytes--a good * stress test. */ #if SMALL_LIMITS #if defined (SO_RCVBUF) && defined (SO_SNDBUF) /* Make sure this really is a stream socket (like TCP). Code using * datagram sockets will simply fail miserably if it can never * transmit a packet larger than 4 bytes. */ { int sn = sizeof (n); if (getsockopt (s, SOL_SOCKET, SO_TYPE, (char *)&n, &sn) < || n != SOCK_STREAM) return; } n = 4; if (setsockopt (s, SOL_SOCKET, SO_RCVBUF, (void *)&n, sizeof (n)) < 0) return; if (setsockopt (s, SOL_SOCKET, SO_SNDBUF, (void *)&n, sizeof (n)) < 0) fatal ("SO_SNDBUF: %s\n", strerror (errno)); #else /* !SO_RCVBUF || !SO_SNDBUF */ #error "Need SO_RCVBUF/SO_SN DBU F for SMALL_LIMITS" #endif /* SO_RCVBUF && SO_SNDBUF */ #endif /* SMALL_LIMITS */ /* Enable keepalives to make sockets time
  • ut
if servers go away. */ n = 1; if (setsockopt (s, SOL_SOCKET, SO_KEEPALIVE, (void *) &n, sizeof (n)) < 0) fatal ("SO_KEEPALIVE: %s\n", strerror (errno)); } void * xrealloc (void *p, size_t size) { p = realloc (p, size); if (size && !p) fatal ("out
  • f
memory\n"); return p; } void fatal (const char *msg, ...) { va_list ap; fprintf (stderr, "fatal: "); va_start (ap, msg); vfprintf (stderr, msg, ap); va_end (ap); 19
slide-20
SLIDE 20 exit (1); } 3.5 Putting it all together W e no w presen t an example that demonstrates the p
  • w
er
  • f
non-blo c king so c k et I/O. Sec- tion 3.6 sho ws the source co de to m ultinger|an async hronous nger clien t. When ngering man y hosts, m ultinger p erforms an
  • rder
  • f
magnitude b etter than a traditional Unix nger clien t. It connects to NCON MAX hosts in parallel using non-blo c king I/O. Some simple testing sho w ed this clien t could nger 1,000 hosts in under 2 min utes. 3.6 multifinger.c: A mostly 3 async hronous nger clien t #include <stdio.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <netdb.h> #include <signal.h> #include <netinet/in.h> #include <sys/types.h> #include <sys/socket.h> #include "async.h" #define FINGER_PORT 79 #define MAX_RESP_SIZE 16384 struct fcon { int fd; char *host; /* Host to which we are connecting */ char *user; /* User to finger
  • n
that host */ int user_len; /* Lenght
  • f
the user string */ int user_pos; /* Number bytes
  • f
user already written to network */ void *resp; /* Finger response read from network */ int resp_len; /* Number
  • f
allocated bytes resp points to */ int resp_pos; /* Number
  • f
resp bytes used so far */ }; int ncon; /* Number
  • f
  • pen
TCP connections */ static void fcon_free (struct fcon *fc) { if (fc->fd >= 0) { cb_free (fc->fd, 0); cb_free (fc->fd, 1); close (fc->fd); ncon--; 3 gethostbyname p erforms sync hronous so c k et I/O. 20
slide-21
SLIDE 21 } xfree (fc->host); xfree (fc->user); xfree (fc->resp); xfree (fc); } void finger_done (struct fcon *fc) { printf ("[%s]\n", fc->host); fwrite (fc->resp, 1, fc->resp_pos, stdout); fcon_free (fc); } static void finger_getresp (void *_fc) { struct fcon *fc = _fc; int n; if (fc->resp_pos == fc->resp_len) { fc->resp_len = fc->resp_len ? fc->resp_len << 1 : 512; if (fc->resp_len > MAX_RESP_SIZE) { fprintf (stderr, "%s: response too large\n", fc->host); fcon_free (fc); return; } fc->resp = xrealloc (fc->resp, fc->resp_len); } n = read (fc->fd, fc->resp + fc->resp_pos, fc->resp_len
  • fc->resp_pos);
if (n == 0) finger_done (fc); else if (n < 0) { if (errno == EAGAIN) return; else perror (fc->host); fcon_free (fc); return; } fc->resp_pos += n; } static void finger_senduser (void *_fc) { struct fcon *fc = _fc; int n; n = write (fc->fd, fc->user + fc->user_pos, fc->user_len
  • fc->user_pos);
21
slide-22
SLIDE 22 if (n <= 0) { if (n == 0) fprintf (stderr, "%s: EOF\n", fc->host); else if (errno == EAGAIN) return; else perror (fc->host); fcon_free (fc); return; } fc->user_pos += n; if (fc->user_pos == fc->user_len) { cb_free (fc->fd, 1); cb_add (fc->fd, 0, finger_getresp, fc); } } static void finger (char *arg) { struct fcon *fc; char *p; struct hostent *h; struct sockaddr_in sin; p = strrchr (arg, '@'); if (!p) { fprintf (stderr, "%s: ignored
  • not
  • f
form 'user@host'\n", arg); return; } fc = xmalloc (sizeof (*fc)); bzero (fc, sizeof (*fc)); fc->fd =
  • 1;
fc->host = xmalloc (strlen (p)); strcpy (fc->host, p + 1); fc->user_len = p
  • arg
+ 2; fc->user = xmalloc (fc->user_len + 1); memcpy (fc->user, arg, fc->user_len
  • 2);
memcpy (fc->user + fc->user_len
  • 2,
"\r\n", 3); h = gethostbyname (fc->host); if (!h) { fprintf (stderr, "%s: hostname lookup failed\n", fc->host); fcon_free (fc); return; } fc->fd = socket (AF_INET, SOCK_STREAM, 0); if (fc->fd < 0) fatal ("socket: %s\n", strerror (errno)); 22
slide-23
SLIDE 23 ncon++; make_async (fc->fd); bzero (&sin, sizeof (sin)); sin.sin_family = AF_INET; sin.sin_port = htons (FINGER_PORT); sin.sin_addr = *(struct in_addr *) h->h_addr; if (connect (fc->fd, (struct sockaddr *) &sin, sizeof (sin)) < && errno != EINPROGRESS) { perror (fc->host); fcon_free (fc); return; } cb_add (fc->fd, 1, finger_senduser, fc); } int main (int argc, char **argv) { int argno; /* Writing to an unconnected socket will cause a process to receive * a SIGPIPE signal. We don't want to die if this happens, so we * ignore SIGPIPE. */ signal (SIGPIPE, SIG_IGN); /* Fire
  • ff
a finger request for every argument, but don't let the * number
  • f
  • utstanding
connections exceed NCON_MAX. */ for (argno = 1; argno < argc; argno++) { while (ncon >= NCON_MAX) cb_check (); finger (argv[argno]); } while (ncon > 0) cb_check (); exit (0); } 4 Finding
  • ut
more This do cumen t
  • utlines
the system calls needed to p erform net w
  • rk
I/O
  • n
UNIX systems. Y
  • u
ma y nd that y
  • u
wish to kno w more ab
  • ut
the w
  • rkings
  • f
these calls when y
  • u
are programming. F
  • rtunately
these system calls are do cumen ted in detail in the Unix man ual pages, whic h y
  • u
can access via the man command. Section 2
  • f
the man ual corresp
  • nds
to system calls. T
  • lo
  • k
up the man ual page for a system call suc h as so ck et, y
  • u
can simply execute the command \man socket." Unfortunately , some system calls suc h as write ha v e names that conict with Unix commands. T
  • see
the man ual page for write, y
  • u
m ust explicitly sp ecify section t w
  • f
the man ual page, whic h y
  • u
can do with \man 2 write" 23
slide-24
SLIDE 24
  • n
BSD mac hines
  • r
\man
  • s
2 write"
  • n
System V. If y
  • u
are unsure in whic h section
  • f
the man ual to lo
  • k
for a command, y
  • u
can run \whatis write" to see a list
  • f
sections in whic h it app ears. 24