#
#    userver -- (pronounced you-server or micro-server).
#    This file is part of the userver, a high-performance web server designed for
#    performance experiments.
#          
#    This file is Copyright (C) 2004-2010  Tim Brecht
#
#    Authors: Tim Brecht <brecht@cs.uwaterloo.ca>
#    See AUTHORS file for list of contributors to the project.
#  
#    This program is free software; you can redistribute it and/or
#    modify it under the terms of the GNU General Public License as
#    published by the Free Software Foundation; either version 2 of the
#    License, or (at your option) any later version.
#  
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
#    General Public License for more details.
#  
#    You should have received a copy of the GNU General Public License
#    along with this program; if not, write to the Free Software
#    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
#    02111-1307 USA
#

----------------------------------------------------------------------
AIO CONVERSION

For doing AIO reads need to have buffers that data can be read into.
I think for this we should try to make use of either the
buf field of the req structure or the uri field.


Some of the calls to things like aio_sock_socket aio_sock_bind aio_sock_listen
might be better renamed alt_socket, alt_bind, alt_listen since these
are not asynchronous calls but rather calls to interfaces that will
provide access to API's of various types that implement AIO.
This might be done with something like a --alt-socket flag.

Attempting to convert from Non Blocking to AIO calls by
adding aio versions of the non blocking calls.
The difference is that the aio calls return immediately with
a different value.
The portion after the code has been split off so that it
can get called after a non blocking call and after notification
of an aio completion.

Q: with aio accepts should we decrement num_idle when the aio_accept
   is initiated or when it is completed.
   Currently it seems that one would want to do this when they are initiated
   under the assumption that most of them would complete successfully.
   The alternative leaves us without a good idea of when to stop accepting
   new connections since by the time num_idle would get low (or to zero)
   we may have already initiated other aio_accepts - which we would
   subsequently like not to accept.


----------------------------------------------------------------------
Notes regarding Socket / aio_interface / API Issues for the userver.

(Note - the server side doesn't use connect as the client side would
so one may want to add connect to this list).

Socket calls that are used by the userver
  socket, bind, 
   To get a socket and bind it to an addr and port.
  setsockopt, ioctl, fcntl
   To modify send and receive socket buffer sizes,
   to modify TCP options (e.g., NAGLE/TCP_NODELAY and SO_REUSEADDR).
  
  listen, accept,
   Specify backlog length for listen and then accept new connections.
  read, write, writev, sendfile, 
   Read request, send response (probably don't need sendfile)
   writev is handy to send header + content.
  close
   Finished with socket.

  select, poll, epoll, etc:
   Instead of finding out which calls could be made on fds that would
   not block one would call aio_suspend - to determine which of the
   aio calls that have been initiated have completed.
   Note that one might have to call aio_error() and/or aio_return() to
   determine which calls have compeleted.

   If aio_suspend really doesn't provide any info regarding which events
   have completed then we'll probably want to try or at least alternately
   have access to the Linux io_getevents() interface.

  I think that aio versions of accept, read, (one or more of 
  write, writev, sendfile), and possibly close
  would be needed.

File system calls that are used by the userver
  open, fstat (for content size for header), read, fstat
  mmap, mlock, madvise, munmap
  
  Would be good to have aio versions of all of the above except
  for perhaps mmap and friends.
  I believe that Linux probably supports and AIO version of read.

May want (but probably won't need) to mix aio calls
to sockets with aio calls to the file system.

----------------------------------------------------------------------
AIO Notes
Fri Jul 25 15:03:53 EDT 2003

This is based on an event driven FSM model for each connection.
The assumption is that one may use kernel supported aio for 
file system access and direct user socket ETA support for socket
access.

For now we introduce new states but it may be good to translate
these to and/or integrate them with existing states.
Start in state 0)

Note that there are a number of possibilities for accepting more
connections once the initial request has been handled.

0)   FSM_INITIAL_STATE
     If want to allow new connections to come in.
     aio_socket_accept()            transition to FSM_AIO_ACCEPTED_CONNECTION

1)   FSM_AIO_ACCEPTED_CONNECTION
     On completion 
       - get sd
       - aio_socket_read(sd)        transition to FSM_AIO_READING_REQUEST

2)   FSM_AIO_READING_REQUEST
2.5) FSM_AIO_READING_NEXT_REQUEST
     On completion 
     If data
     - parse request
     - check cache
       on miss aio_file_open(url)   transition to FSM_AIO_OPEN_FILE
       on hit  aio_socket_write(sd) transition to FSM_AIO_WRITING FILE
     If no data
       - aio_socket_close(sd)       transition to FSM_AIO_CLOSING

3)   FSM_AIO_OPEN_FILE
     On completion
       - get fd
       - aio_file_fstat(fd)         transition to FSM_AIO_FSTAT_FILE

4)   FSM_AIO_FSTAT_FILE
     On completion
       - use file size for content header (needed for SPEC)
       - aio_file_read(fd)          transition to FSM_AIO_READING_FILE
       - [this may be mmap instead]


5)   FSM_AIO_READING_FILE
     On completion
       - aio_socket_write(sd)       transition to FSM_AIO_WRITING_REPLY

6)   FSM_AIO_WRITING_REPLY
     On completion
       - aio_socket_read(fd)        transition to FSM_AIO_READING_NEXT_REQUEST

7)   FSM_AIO_CLOSING
     On completion                  transition to FSM_AIO_CLOSED

8)   FSM_CLOSED
     On completion 
       - check into accepting new connections?

----------------------------------------------------------------------
Dynamic Content Options

- Make use of Custom Ad code from CAD_u.c

Possible approaches:
 o try to fully integrate with userver
     - i.e., server calls the code directory
     - when writing to buffers do it in non-blocking mode and use select, etc.
     - con: ? hard to multiplex dynamic requests

 o talk to a backend server that handles them - like TUX or Apache
     - forward the request
     - wait for and read the response
     - write the response back to the client

 o create separate processes for this
     - communicate through a pipe or socket
     - this may look a lot like one above

Have different groups try different options.
----------------------------------------------------------------------
Using multiple userver processes 

- listen to some IP addresses on one socket using modified bind
- create one socket per NIC each bound to a different IP address.
   - need to then keep track of a list of server_sds
   - use fd set to track them
   - a fair bit of code currently uses server_sd.
      issues w.r.t. turning on and off connections

Compare N servers doing IP_ANY_ADDR with N servers doing some subset
   of addresses.

----------------------------------------------------------------------
Using multiple listening sockets

#define ONE_LISTENER
should be the common and fast case (i.e., ideally supporting multiple
listeners should not impact the ONE_LISTENER case).

Note clear the the SEND kernel stuff and/or the userver using the
SEND kernel stuff is set up to handle multiple listening sockets.


 NOTE: we expect that the number of listening sockets would be relatively
 small (e.g., one per NIC/IP addr).

Command line options to specify ip address and port to listen on.
  --ip-addr 192.168.10.105:6800 --ip-addr 192.168.20.105:6800 

Currently we get (accept) new connections via:
   do_new_connections() which calls sock_new_conn().
   Neither specifies a listening socket because there is currently
   only one socket that is listening (server_sd).

   If we move to multiple listening sockets I expect that
   we'll have some instances where we'll want to call accept
   with a specific socket (e.g., when an event tells
   us someone is trying to connect on that socket).

   In other cases we'll want to just try to accept new connections
   on any socket. In which case we'll probably want to try to 
   figure out a scheduling scheme for this.

   Maybe the calls are:

   do_new_connections(int sd, int called_from);
   sock_new_conn(int sd);

   Use negative values to specify:

   ANY_LISTENER       (MIN_INT)
   ALL_LISTENERS      (MIN_INT+1)
   N_LISTENERS        (any negative value that isn't 
                       ANY_LISTENER or ALL_LISTENERS)
   specific socket    (0 ... max_fd)

   In some cases we'll need to loop over multiple LISTENING sockets.

   if (sd == ANY_LISTENER) {
     sd = get_listener()
   }

   get_listener could round robin among the different listening sockets.

   Ideally we might have an interface to the OS that would help
   us to make a better decision (e.g., how many outstanding connection
   requests are there).

userver.c and server_sock.c probably require the most work
especially server_init()

----------------------------------------------------------------------
The finite state machine in the userver looks like:

         Init
          |
          | accept started
          v
      Connecting
          |
          | accept done     
          |
          | read started
          v
    Reading Request --->----------->------------------+  read zero bytes
          |                                           |  close started
          | read done                                 |
          |                                           |
          |                                           |
          | write started                             |
          v                                           v
   Writing Reply <--------+                           |
     |    |               |                           |
     |    | write done    |      close started        |
     |    |  HTTP/1.0 ->--|-------->---------------+  |
     |    |               |                        |  |
     |    |               |                        |  |
     |    | read started  |                        |  |
     |    |               ^     read zero bytes    |  |
     v    v               |     close started      v  v    close done
  *RAWW*  |               |                        |  |
   (AIO)  |               |                        |  |
     |    |               |                        |  |
     v    v               |                        v  v
Reading Next Request -->--|-------->------------> Closing -----------> Closed
          |               |
          v read done     ^
          |               |
          | write started |
          +------->-------+
            
*RAWW* - Read Arrived While Writing
         Added a new state for AIO.
  o can only happen when preposting reads
  o happens when we are in a writing state but we got a read notification
  o when this happens we:
      o save the information for the read completion to handle it later
        (assume we can only get one of these per connection)
      o later when we get the write completion we handle the
        write completion and then handle the read completion
        using the information that we've saved
----------------------------------------------------------------------
Notes on AIO with preposting.

If we don't prepost (reads)? we can only ever have one event completion
for any sd in any queue at any point in time.

Accepts are always preposted.
This can be done at various points.
  o Currently we prepost a accepts when starting up the server.
  o When a connection is closed.

In addition to accepts we can only prepost reads.

Writes can not be preposted because we need to 
know what to reply. And this can only be determined
after the read completes. This forces writes to be
posted in order (i.e., not preposted).

Closes can not be preposted.
There are two cases:
  o HTTP 1.0   - must wait until the write completes before
                 calling close
  o HTTP 1.1   - must wait until the read request returns zero bytes
                 which indicates that the client has closed their
                 and of the connection

If we prepost accepts and reads.
Reads should be preposted
  a) prior to accept calls, this is done with aio_sock_read_accept()
  b) prior to writes, this is probably best done right after getting
     a read completion (if it's not a 1.0 request).


switch (completion_type) {
  
  ACCEPT_COMPL:
    switch (current_fsm_state) {

      // Got ACCEPT_COMPL while
      CONNECTING:
        // this is in order
	complete_accept()

        if (not preposting reads) {
          do_read()
        }
        // state will transition to READING_REQUEST
        break;

      // Got ACCEPT_COMPL while
      READING_REQUEST:
        // accept arrives after a read completion
        // we've already assumed that the connection was accepted
        // so now just reap/ignore the accept completion
        break;

      // Got ACCEPT_COMPL while
      READING_NEXT_REQUEST:
        // shouldn't happen
        exit/assert;
        break;

      // Got ACCEPT_COMPL while
      READ_ARRIVED_WHILE_WRITING:
        // can't happen here.
        exit/assert;
        break;

      // Got ACCEPT_COMPL while
      WRITING:
        exit/assert;
        break;

      // Got ACCEPT_COMPL while
      CLOSING:
      CLOSED:
        // can't happen here.
        exit/assert;
        break;

      // Got ACCEPT_COMPL while
      default:
        // shouldn't happen
        exit/assert;
        break;

    }
    // End of ACCEPT_COMPL
    break;

  READ_COMPL:
    switch (current_fsm_state) {

      // Got READ_COMPL while
      CONNECTING:
        // can only happen if preposting reads
        complete_accept();
        // state will transition to READING_REQUEST

        complete_read();

        if (not preposting reads) {
          do_read();
        }
        break;

      // Got READ_COMPL while
      READING_REQUEST:
      READING_NEXT_REQUEST:
        // this is in order
        complete_read();
        if (preposting reads) {
          do_read();
        }
        // now initiate the write
        do_write();
        // state will transition to WRITING
        break;

      // Got READ_COMPL while
      READ_ARRIVED_WHILE_WRITING:
        // can't happen
        // if we are in this state the only event we should be able
        // to get now is a write completion
        exit/assert;
        break;

      // Got READ_COMPL while
      WRITING:
        // Can only happen if we are preposting reads.
        // This is the case the causes us to transition 
        // the READ_ARRIVED_WHILE_WRITING state.
        saved_results = result;
        saved_error = error;
        // Later when the write completion arrives well
        // go ahead and process this read completion

        // state will transition to READ_ARRIVED_WHILE_WRITING
        break;

      // Got READ_COMPL while
      CLOSING:
      CLOSED:
        // shouldn't happen
        // this would be a close arriving before a read completion
        exit/assert;
        break;

      // Got READ_COMPL while
      default:
        // shouldn't happen
        exit/assert;
        break;
    }
    // End of READ_COMPL
    break;

  WRITE_COMPL:
    switch (current_fsm_state) {

      // Got WRITE_COMPL while
      CONNECTING:
      READING_REQUEST:
      READING_NEXT_REQUEST:
        // shouldn't happen
        // this would be a write arriving before 
        // an accept and/or read completion
        exit/assert;
        break;
        
      // Got WRITE_COMPL while
      READ_ARRIVED_WHILE_WRITING:
        // previously while we were in a writing state we got a read completion
        // (because a preposted read completed) we made note of that
        // by transitioning to this state and so now after we
        // complete the write we need to complete the read
        // that we got a notification for previously.

        complete_write();

        complete_read(saved_result, saved_error);

        // can assert that we are preposting reads
        // because otherwise we shouldn't be in this state
        if (preposting reads) {
          do_read();
        }

        // state will transition to READING_NEXT_REQUEST
        break;

      // Got WRITE_COMPL while
      WRITING:
        // this would be in order
        complete_write();

        if (not preposting reads) {
          do_read();
        }
        // state will transition to READING_NEXT_REQUEST
        break;

      // Got WRITE_COMPL while
      CLOSING:
      CLOSED:
        // shouldn't happen this would be a write arriving
        // after starting/finishing a close
        exit/assert;
        break;

      // Got WRITE_COMPL while
      default:
        // shouldn't happen
        exit/assert;
        break;
    }
    // End of WRITE_COMPL
    break;

  CLOSE_COMPL:
    switch (current_fsm_state) {
     
      // Got CLOSE_COMPL while
      CONNECTING:
      READING_REQUEST:
      READING_NEXT_REQUEST:
      WRITING:
      READ_ARRIVED_WHILE_WRITING:
        // shouldn't happen
        // this would be a close completion arriving before 
        // an accept and/or read and/or write completion
        exit/assert;
        break;
        
      // Got CLOSE_COMPL while
      CLOSING:
        // in order
        complete_close();
        // state will transition to CLOSED
        break;

      // Got CLOSE_COMPL while
      CLOSED:
        // shouldn't happen this would be a close completion arriving
        // after the connection has already been closed
        // (i.e., the close completion has already arrived).
        exit/assert;
        break;

      // Got CLOSE_COMPL while
      default:
        // shouldn't happen
        exit/assert;
        break;
    }
    // End of CLOSE_COMPL
    break;

  // Unknown/handled completion type
  default:
    // shouldn't happen 
    exit/assert;
    break;
}

----------------------------------------------------------------------
 o epoll-ET
   The userver has and used a FSM for all of the approaches.
   The real difference is that now the userver needs to keep track
   of the current state of the socket/sd.

   With edge triggered epoll if
   the epoll_wait call tells us that the sd is writable
   we need to keep track of that for later.
   Once we read data from the socket we need to keep track of whether
   or not that socket is still readable or if we'll get another event
   to tell us it's readable.

   Similar things are done for writability.

   The socket is readable when epoll_wait tells us and remains readble until:
     o a read returns fewer bytes than requested
     o a read returns 0 bytes (indicating a close by the client)
     o a read returns EWOULDBLOCK
     o ?? what about other failure conditions??

     At this point the socket state (not the FSM) is not readable
     and the server will not try to read again until epoll_wait
     indicates the socket is readable.

   The socket is writable when epoll_wait tells us and remains writable until:
     o a write/writev/sendfile returns fewer bytes than requested
     o a write returns EWOULDBLOCK
     o ?? what about other failure conditions??
----------------------------------------------------------------------
Notes regarding Nagle's algorithm, and TCP_CORK.

See Network Programming by Richard Stevens for info about Nagle's algorithm.

Nagles's algorithm is designed to reduce the number of small packets.
If a connection has outstanding data and it's currently waiting for an ACK
then no small packets will be sent until the existing data is ACK'ed.

Stevens points out that this algorithm often interacts with the delayed ACK
algorithm which waits some amount of time in the hope that it can piggyback
the ACK with some data.

The userver disables the Nagle algorithm on all sockets.
This is done using setsockopt by setting the TCP_NODELAY options.
This permits small replies to be send immediately (i.e., without 
waiting for outstanding ACKs from the clients).

This causes a problem when using sendfile and other methods that require
us to send a reply using two separate system calls. I don't recall if
this is a problem when using writev.

On Linux the sendfile interface is not as general as it is on HP-UX or
Solaris.  On those systems one can provide pointers to data buffers as
well as files to be sent allowing one system call to be used to send both
the reply header and the file.  On Linux we are required to first write
the reply header (using write) and then send the file (using sendfile).

With TCP_NODELAY turned on (i.e., the Nagle algorithm is disabled)
can result in the header being sent in a packet that is separate from
the file.  This isn't very nice for small files.

As a result a TCP_CORK option was added to Linux and is controlled through
the setsockopt call. When the options --use-sendfile and --use-tcp-cork
are both used the userver uses roughly the following sequence of calls.

   setsockopt()    // cork the socket holding back the packet until uncorked.
   write()         // write the reply header
   sendfile()      // send the file
   setsockopt()    // uncork the socket and the header and file are sent
                   // in the same file.

See the code in do_sendfile for details.
----------------------------------------------------------------------
Notes on using AIO parameters.

This hasn't been tried and/or tested yet but I believe that the accept
rate can be controlled using a combination of:

  --aio-accept-thold=N  
     Controls how many aio_sock_accept calls to have outstanding.
     This will always try to ensure that N calls have been 
     initiated/preposted for which we don't have an accept completion event.
    
  --accept-count=N  (same as -m N)
     Controls how many consecutive aio_sock_accept calls to make.
     If this used as --accept-count=0 whenever the userver trys
     to initiate/prepost the accepting of new connections it will
     repeatedly call aio_sock_accept until --aio-accept-thold=N
     calls have been made for which we don't have completion events.
     So this will also interact with --aio-accept-thold.
     Also note that the userver also limits the total number of
     simulataneous connections to --max-conns=N (-c N).
     The userver needs to ensure that we never have more than
     the specified maximum number of connections so it must
     ensure that the total number of current connections plus
     the number of preposted accepts does not exceed --max-conns.

  --listenq=N
    This can be used to control the length of the listenq
    (i.e., the backlog parameter to the listen system call).
    Note that there are some early tests that will discard
    SYN packets if the kernel doesn't think that there will
    be room in the listenq/acceptq when the 3-way handshake completed.
    Also note that although Linux doesn't complain if you
    use a large value for N it simply changes the value to
    SOMAXCONNS (128).

  --max-conns=N (-c N)
    Places a cap on the maximum number of simultaneous connections.
    NOTE: see the src/README file that comes with the userver
    for information about how to increase the FD_SETSIZE in order
    to use large values (i.e., larger than 1024). 
    Yes this is currently required even if select isn't being used.

  --max-fds=N   (-f N)
    Places a cap on the maximum value that any socket descriptor will have.
    NOTE: see the src/README file that comes with the userver
    for information about how to increase the FD_SETSIZE in order
    to use large values (i.e., larger than 1024).
    Yes this is currently required even if select isn't being used.
    Note that with caching on (--caching-on or -C) the maximum
    value for a socket descriptor can get very large.

  I think that high accept rates will be acheived by using a combination
  of these parameters.

  For example

  ./userver --max-conns=15000 --max-fds=32000 --aio-accept-thold=200
            --accept-count=0  --listenq=128 --caching-on


Be aware that if any of the limits are exceeded the userver
will likely print a message and/or thrown an assertion,
print out it's current stats and exit.

This is especially true of many conditions that might be
exceeded with --caching-on (e.g., the hash table becomes too
full or we can't find a victim because every file in
the cache is currently being served/referenced by at least one
connection).
----------------------------------------------------------------------
SOME NOTES ON THE HIGH-LEVEL OPERATION OF THE USERVER WITH AIO

NOTE: that I haven't been able to test some
of the options and combinations of options.
Once we have a more stable system I'll debug
some of these options and look at their impact
on performance.

ALSO NOTE: there are lots of different ways of doing 
most of the things here.
I've take the approach of trying to construct something that may be
reasonably flexible and then to look at how to tune
it when the system is stable enough to do so.


After initialization we enter the aio_loop.
We prepost a bunch of read/accepts (specified by -c option).
These are divided evenly among the different listening/accepting
sockets/nics.

We then check if there are events to process.
If there aren't any we go to sleep until
a least one event arrives to wake us up.

At this point we find out how many events of
each type are available. Yes this could be done 
differently which might be more efficient.
For now we favour knowing how many events of
each completion type we'll need to process
so that if we want/need to we can make informed
decisions about which events to process.

We currently process all completion events of each
type in the order specified (see completion_order).
  The current default order is write, read, accept completions.

We move through a finite-state-machine for each connection.
o When we are ready to accept a new connection we prepost
  the read and then the accept (at the same time)
o We prepost another read as soon as a read completion arrives.
o We post the write as soon as we can after knowing what the
  uri is (we can't do this until we actually have the read
  completion - because that contains the uri being
  requested).
o To support HTTP/1.1 we have to wait until a read
  completion returns 0 or an error like CONNRESET before
  closing/shutting down the connection.
o Currently new new accepts are posted in batches
  after processing completion events of whatever type are
  posted. One thing we want to look at is does it make
  any difference to performance if we prepost a read/accept
  whenever a connection is closed. I believe that this
  may work by using the --accept-on-close option but
  this hasn't been tested.


-----------
aio_loop()
{

  while (1) {

    initiate_aio_accepts();

    n_write_compl = aio_sock_num_events(write_cq);
    n_total += n_write_compl;

    n_read_compl = aio_sock_num_events(read_cq);
    n_total += n_read_compl;

    n_accept_compl = aio_sock_num_events(accept_cq);
    n_total += n_accept_compl;

    /* if there aren't any completion events then wait */
    if (n_total == 0) {
      rc = aio_wait(ALL_QUEUES);
      /* on return get back up to the top of the loop 
         and find out how many of each type,
         these numbers can be used to schedule whic
         completions to handle
       */ 
    } else {

      /* the completion program is programmable */
      for (i=0; i<completion_count; i++) {
        switch(completion_order[i]) {
          case SOCK_READ_COMPL:
            if (n_read_compl > 0) {
              handle_completions(SOCK_READ_COMPL, n_read_compl);
            }
            break;

          case SOCK_WRITE_COMPL:
            if (n_write_compl > 0) {
              handle_completions(SOCK_WRITE_COMPL, n_write_compl);
            }
            break;

          case SOCK_ACCEPT_COMPL:
            if (n_accept_compl > 0) {
              handle_completions(SOCK_ACCEPT_COMPL, n_accept_compl);
            }
            break;
        }
      }

    }
  }
}

--------

handle_completions(compl_t type, int n)
{
  /* by default we don't put any limits on
     how many events of each type to process.
     But limits can be set for each type on
     the command line e.g.,
     --aio-write-events-limit 200
     --aio-read-events-limit 100
     --aio-accept-events-limit 0 (i.e., not limit)
   */
  switch(type) {
    case SOCK_WRITE_COMPL:
      thold = options.aio_write_events_limit;
      queue = write_cq;
      break;

    case SOCK_READ_COMPL:
      thold = options.aio_read_events_limit;
      queue = read_cq;
      break;

    case SOCK_ACCEPT_COMPL:
      thold = options.aio_accept_events_limit;
      queue = accept_cq;
      break;

    default:
      printf("handle_completions: unknown completion type = %d [%s]\n",
          type, compl_type_str[type]);
      exit(1);
      break;
  }

  /* get either the limit set above or as many
     events as could possibly fit into the completion
     array
   */

  if (thold) {
    max = thold;
  } else {
    max = completions_max;
  }

  /* get the events from the specified queue */
  rc = aio_sock_getevents(queue, max, completions);
  n = rc;

  /* process the event completion events */
  process_aio_events(n, completions);
}

---------
Some example completion orders.
Default is currently set to OPT_AIO_GOOD_COMPLETION_ORDER.

Note that this currently doesn't support
orderings based on queue lengths but
it wouldn't be hard to support.

/* This order is helpful for testing out of order arrivals */
case OPT_AIO_TEST_COMPLETION_ORDER:
  completion_order[count++] = SOCK_READ_COMPL;
  completion_order[count++] = SOCK_WRITE_COMPL;
  completion_order[count++] = SOCK_ACCEPT_COMPL;
  break;

/* This is something we'd expect to be an ordering for good performance */
case OPT_AIO_GOOD_COMPLETION_ORDER:
  completion_order[count++] = SOCK_WRITE_COMPL;
  completion_order[count++] = SOCK_READ_COMPL;
  completion_order[count++] = SOCK_ACCEPT_COMPL;
  break;

/* This is a fairly safe completion order, usually being processed in order */
case OPT_AIO_SAFE_COMPLETION_ORDER:
  completion_order[count++] = SOCK_ACCEPT_COMPL;
  completion_order[count++] = SOCK_READ_COMPL;
  completion_order[count++] = SOCK_WRITE_COMPL;
  break;

/* This is an example completion order showing that we could do the
 * same completion event types more frequently than others
 */
case OPT_AIO_EXAMPLE_COMPLETION_ORDER:
  completion_order[count++] = SOCK_WRITE_COMPL;
  completion_order[count++] = SOCK_READ_COMPL;
  completion_order[count++] = SOCK_WRITE_COMPL;
  completion_order[count++] = SOCK_ACCEPT_COMPL;
  completion_order[count++] = SOCK_READ_COMPL;
  completion_order[count++] = SOCK_WRITE_COMPL;
  break;

----------------------------------------------------------------------
Some notes on how to communicate with the app servers without sockets
with communication through shared memory

Getting data/requests to app servers from the userver
 o app server does sigwait 
 o userver puts request in shared memory
 o userver sends signal to the appserver
 o app server wakes up and gets request from shared memory

 Note that communication/signalling of the buffer being
 ready here could be done in other ways (e.g., semaphores).
 The key here is that the userver signals in a non-blocking
 fashion and the app server makes a call that blocks
 until it is signaled.

Getting responses from app servers to the userver
 o app server writes result into shared memory
 o when the app server is done it sets an app server done bit in 
   shared memory indicating done
   it then checks if the userver is waiting for a signal (checks the
   userver waiting for signal bit in shared memory) 
   if it is waiting for a signal, it sends the signal
   lastly it calls sigwait to wait for a signal from the userver (as above)

 o userver polls on the app server done bit
     if the bit is set 
       sendfile the data out of the shared memory buffer
     else
       enable the signal and call select/poll/epoll
       set userver waiting for signal bit (indicates calling select)
       repoll the app server done bit (to handle race conditions)
       if nothing app server done bit is not set
       call select

 o userver will wake up either because of signal delivery 
   (which means an app server has completed) or because
   of activity on the client sockets

   signal delivery will invoke the signal handler which figures
   out which app server is done, does the appropriate interest
   set manipulation, disables the signal, and returns

   Note that select/poll/epoll may get EINTR and need to restart.

   select returning will disable the signal and then
   handle the work to do from select

   if a signal arrives prior to disabling the signal
   we just handle the signal and then resume execution

   does the userver need to figure out if it got the signal before, during
   or after calling select/poll/epoll?

----------------------------------------------------------------------
See fastcgi/NOTES for when/how we process dynamic/FCGI requests.
----------------------------------------------------------------------
SETTING UP AND RUNNING SPECWEB99 AND/OR SPECWEB99-LIKE EXPERIMENTS

This portion of this document describes how to run the SPECweb99 benchmarking suite
against a web server. This document uses the userver as an example,
and focuses on the dynamic portion of SPECweb99.

SERVER-SIDE
===========

- Basic directory structure
  /home/brecht/userver-spec  (call this the $TARGET directory)
  /home/brecht/userver-spec/specweb99  (this will contain the file_set)
  /home/brecht/userver-spec/specweb99/file_set  (this contains dirs/files)

- Check out, compile, and install the userver on the server machine.

  For example in /home/brecht/userver-spec (call this the $TARGET directory)

- Check out, compile, and install the SPECweb99 tools (wafgen99, upfgen99, cadgen99)
  on the server machine.

  For example in /home/brecht/userver-spec (if you want to run specweb99).
  Note that this will also be where the appserver lives.


- Use the wafgen99 program to generate a SPECweb99 fileset. You will 
  need to decide how big a fileset is needed for your workload.

  For example in /home/brecht/usever-spec/specweb99 run 
  % wafgen -d 100
  
  This creates a directory file_set that contains 100 directories worth of files.
  NOTE: this only needs to be done once. A symbolic link can be put into the
  /home/brecht/usever-spec/specweb99 directory to a shared file_set directory.

- Use the cadgen99 program to generate a Custom.Ads file. 
  Here is an example command line:
	% cadgen99 -C . -e 100 -t 100 1,100
  The -C argument allows you to specify the destination directory that the
  Custom.Ads file will be output to. Ideally, this should be the $TARGET
  directory mentioned at the start of this section.
  NOTE: on recent systems the above command may dump core.
        Instead you may need to use:
	% cadgen99 -C . -e 100 -t 100 1 100

  This command line was taken from a URL generated by the SPECweb99 
  manager script. The important thing is that the script generates 
  a Custom.Ads file with 359 entries, as this is the range of ads
  requested by the SPECweb99 (or httperf) clients.

  This is run in the $TARGET directory.


- Use the upfgen99 program to generate a User.Personality file.
  Here is an example command line:  
	% upfgen99 -C . -n 1000 -t 100

  This command has three arguments. The -C argument simply specifies
  the directory that the User.Personality file will be written to.
  The -n 1000 argument specifies that the workload simulates 1000 users.
  This number should match the number of users specified when generating
  the httpspec99-*.log files for use with httperf.

  This is run in the $TARGET directory.
 
- Copy the following files to the web servers document root ($TARGET)
    - User.Personality
    - Custom.Ads
    - specweb99-fcgi.pl (the SPECweb99 app server from src/fastcgi/ of
      the userver directory)

- Make and install the fcgi support

  I'm not sure if this step is necessary:
    % cd userver/src/fastcgi/fcgi-2.4.0
    % ./configure
    % make
    % make install (as root)

  This step is necessary (call this step X below)
    % cd userver/src/fastcgi/fcgi-2.4.0/perl
    % perl Makefile.PL
    % make
    % make install (as root)

  You can test if this has worked by 
  % cd $TARGET
  % ./specweb99-fcgi.pl 9000

  If you get something like:
     Can't locate FCGI.pm in @INC (@INC contains: /etc/perl
     /usr/local/lib/perl/5.8.7 /usr/local/share/perl/5.8.7 /usr/lib/perl5
     /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8
     /usr/local/lib/site_perl .) at ./specweb99-fcgi.pl line 37.
     BEGIN failed--compilation aborted at ./specweb99-fcgi.pl line 37.

  it is because step X above has failed.

- In the web server's document root ($TARGET), create an empty post.log file
	% touch post.log    

  This step is usually performed in SPECweb99 by sending a "Reset" request
  to the appserver as its very first request. When running with trun/httperf
  we must do this step manually. Also, ensure that there is sufficient disk
  space for the post.log to grow. Long running experiments can generate large
  post.log files.

- Open the source to the appserver (specweb99-fcgi.pl). Ensure that the
  $topdir variable is set to $TARGET. 

  E.g. /home/brecht/userver-spec because the log files are generated
         to include specweb99/file_set.

- Run the userver, and ensure that it can serve a simple dynamic request.
  In order to serve dynamic requests, the --app option must be given to the
  userver must be configured to allow it to use one or more appservers.
  The --app option has the following format:

	--app=APP,PROTO,HOSTNAME:PORT,NUM

  For example:

	--app=specweb99-fcgi.pl,FASTCGI,localhost:9000,5 

  The --app does not start the appservers. If no other options are supplied, 
  then the appservers must be started manually. However, the --start-app-server
  option will cause the userver to start it's own appservers. That option has
  the following format:

	--start-app-server=localhost:PORT,NUM=PATH,CPUAFFINITYMASK 

  For example:

	--app=specweb99-fcgi.pl,FASTCGI,localhost:9000,5 
	--start-app-server=localhost:9000,5="/home/brecht/userver-spec/specweb99-fcgi.pl",0x000d

  Please see the userver man page for more information on these command line options.

