RT-11/TSX-Plus System Timing Jim Crapuchettes Omnex Corporation 2483 Old Middlefield Way Mountain View, CA. 94043 (415) 966-8400 Spring 1987 DECUS Symposium Nashville, TN (c) Copyright 1987, Omnex Corporation I. A little history-- In late 1985 we began work on our Q-bus Ethernet handler for the DEQNA, which later became NQ-Plus, a superset of the DEC NQ handler. A significant re- quirement in our design was support for what we call "split headers"; in add- ition to the data being sent, Ethernet frames contain at least a 48 bit source address, a 48 bit destination address and a 16 bit protocol type. Often, add- itional information is required. In the RT-11 NQ, this network-related infor- mation must be stored in memory immediately preceding the data itself; this often means that the data must be copied in memory (sometimes more than once) to assemble a complete Ethernet frame before it can be sent. So we wanted to be able to have a user separately specify the memory area where this informa- tion about the data is located. Since a major goal of this design was to maximize DEQNA performance, we wanted to be sure that the method we selected for specifying the header area did not, in fact, degrade performance. We had several possible ideas for specifying the area, but wanted to know which was best. I am a believer in measuring program timing when it can be done. We have a KWV11-C equivalent programmable real-time clock, so I wrote a program to use the clock to time a system call. Since we wanted to know how long it would take to call a handler with an I/O request, the program timed a very simple I/O request, a wait-mode read of 2 words from block 0 of the NL: handler. The clock was set up to run with a 1MHz timebase (1 microsecond per pulse) and the overflow interrupt was enabled to increment a high-order word on overflow. When the operation was completed the clock was stopped and the count of pulses from the 1 microsecond timebase was read. After each of 5 measurements the two words of time were converted to octal and decimal and output it to a file. As I began using this program, I noticed that the times varied more than I ex- pected. So I began increasing the number of trials (time measurements); from 5 to 10, then from 10 to 50, then from 50 to 100, then from 100 to 1,000, and finally from 1,000 to 10,000. While doing this, I also stopped worrying about 32-bit time values (16 bits can time to 65 milliseconds) and just stored the time in memory; when all timing trials were done, the times were converted to decimal and output to a file. Shortly after I began this experimentation I realized that the Line-Time Clock was the major source of some of the varia- tion I was seeing, but I left it enabled for all of the tests so that the time taken by the clock interrupt service routines could be identified in the data. RT-11/TSX-Plus System Timing, Spring 87 DECUS Page 2 As the number of measurements increased, we had to have a way to analyze the results. So I also wrote a program to read in the times, perform a frequency analysis on the values and then generate a histogram from the results. Since I was field testing the new RT-11 BASIC-Plus at the time, I wrote the analysis program using that language. As we began to look at some of the histograms and talk about what they meant, we came to the conclusion that we absolutely did not want to do a separate system call to specify the split header for each received or transmitted frame. So we chose to use a Special Function call to specify two header areas for transmit and two for receive on each of the eight NQ units. The headers are described by their location and size. After this semi-static definition, the "block" argument for the transmit and receive Special Functions are used to select one of the two split headers or no split header. Having made this decision, we had no real reason to continue further study of the timing data, but I was interested in some of the features that appeared in the histograms. So I did what turned out to be quite a bit of additional work to try to better understand what we had seen. II. What we did-- The time measurements were made on the 2 Q-bus processors that we use regular- ly, an 11/23 and an 11/73. Each test was done under RT-11/SJ, RT-11/FB, RT- 11/XM and TSX-Plus. Under RT-11/FB, RT-11/XM and TSX-Plus the tests were also run with more than one program in memory. We ended up with 10 test configura- tions that were run on each of the two processors. The configurations were: code configuration ---- ------------- 1. sj RT-11/SJ 2. fb RT-11/FB 3. f1 RT-11/FB with VDT in the foreground 4. f2 RT-11/FB with VTCOM in the foreground 5. xm RT-11/XM 6. x1 RT-11/XM with VDT in the foreground 7. x2 RT-11/XM with VTCOM in the foreground 8. t1 TSX-Plus V6.2 with one job 9. t4 TSX-Plus V6.2 with WINPRT detached and 2 jobs in KMON 10. t5 TSX-Plus V6.2 with WINPRT detached and 3 jobs in KMON Only part of the "f2" data was acquired. With the "f2" configuration, there was not enough memory for the program, VTCOM and the USR, so on one of the tests, the USR began swapping in and out for every call; these two tests were terminated and the data files were discarded. I reviewed all of the RT-11 system calls and selected a small set to time. I specifically avoided timing I/O on any mass-storage device (since I was trying to time the system, not the disk), but did expand the read timing tests on NL: to include VM: (RAM disk in main memory) and QM: (RAM disk on a separate board that we have been developing). RT-11/TSX-Plus System Timing, Spring 87 DECUS Page 3 The system calls selected were: Request Switch Comments ------- ------ -------- .DATE /D .GVAL /G .RCTRLO /O .TRPSET /T #0 (reset the trap handler) .CSTAT /C Do with open channel .READW /R Only to NL:, VM: or QM: .WAIT /W Do with open (idle) channel .LOOKUP /L Do with a single file, .PURGE after .PURGE /P Done with channel purged already In addition, I timed a "nop" test which consisted of a "mov #arg,r0" and a "nop" to serve as a baseline for the timing values. (This test was selected with the /N switch). And, as I mentioned above, the READW was done using NL:, VM: and QM:. However, when starting the tests with QM: on the 11/73, after the tests with QM: on the 11/23 were finished, the results showed that the times for VM: and QM: were so similar that the QM: tests were not finished. With the two tests that were terminated due to USR swapping, the total number of tests ranged between 10 and 12 for each of the tested configurations and processors. For reference, two examples of the code sequences used are: .sbttl $nop - Point to read argument area and do a nop $nop:: 10$: mov #KW1M+KWext+KWGO,@#KWCSR ;Start the clock mov #rargs,r0 ;Get address of argument area .list meb ;[Expand so nothing hidden] nop ; and do a NOP .nlist meb ;[Return to normal listing] bis #KWmSt2,@#KWCSR ;Trigger ST2 to get count bic #KWGO,@#KWCSR ; and Stop the clock mov @#KWBPR,ClkCl ;Get low part of clock time mov ClkCl,(r5)+ ;Save this time interval dec tstcnt ;Done with all tests yet? bne 10$ ; No, go back for another return ;Yes, return to caller RT-11/TSX-Plus System Timing, Spring 87 DECUS Page 4 .sbttl $date - Get the system date word $date:: 10$: mov #KW1M+KWext+KWGO,@#KWCSR ;Start the clock .list meb ;[Expand so nothing hidden] .date ;Get the system date word .nlist meb ;[Return to normal listing] bis #KWmSt2,@#KWCSR ;Trigger ST2 to get count bic #KWGO,@#KWCSR ; and Stop the clock mov @#KWBPR,ClkCl ;Get low part of clock time mov ClkCl,(r5)+ ;Save this time interval dec tstcnt ;Done with all tests yet? bne 10$ ; No, go back for another return ;Yes, return to caller III. The data files-- Each data file is 41 (decimal) blocks long and consists of a one-block header and 40 blocks of 10,000 16-bit binary time values in microseconds. The header block has three variable-length lines of ASCII text, terminated by Carriage- Return/Line-Feed pairs, and at least 3 extra Carriage-Return/Line-Feed pairs; the remainder of the block is null-filled. The first line contains the pro- gram name, program version, date, time and operating system name. The second line is a copy of the CSI command line given to the timing program; it shows the output file name, input device (if specified) and switch specifying the operation to be timed. The third line is a comment line. The data files are named the following way: "pp" is the two digit processor type (23 or 73), "t" is the one character test code (switch code), "dv" is either null or the two-character name of the input device for the .READW tests and "cc" is the configuration code from above. These pieces are combined to form the data file or histogram file name -- pptdv.Dcc (Data file) or pptdv.Hcc (Histogram file). Note that none of the histogram files have been included; they can be regen- erated as needed. The timing data from the 10 to 12 tests, 10 configurations and 2 processors resulted in 231 data files. The exact complement of the data files is given in the two tables below. Each "X" represents a data file that is present in the data sets on this SIG tape. With 10,000 time values in each data file, these files contain over 2.3 million time samples. RT-11/TSX-Plus System Timing, Spring 87 DECUS Page 5 ** Summary of data files provided ** SJ FB F1 F2 XM X1 X2 T1 T4 T5 23N X X X X X X X X X X 23D X X X X X X X X X X 23G X X X X X X X X X X 23O X X X X X X X X X X 23T X X X X X X X X X X 23C X X X X X X X X X X 23RNL X X X X X X X X X X 23RVM X X X X X X X X X X 23RQM X X X X X X X X X X 23W X X X X X X X X X X 23L X X X X X X X X X 23P X X X X X X X X X X SJ FB F1 F2 XM X1 X2 T1 T4 T5 73N X X X X X X X X X X 73D X X X X X X X X X X 73G X X X X X X X X X X 73O X X X X X X X X X X 73T X X X X X X X X X X 73C X X X X X X X X X X 73RNL X X X X X X X X X X 73RVM X X X X X X X X X X 73RQM X X X 73W X X X X X X X X X X 73L X X X X X X X X X 73P X X X X X X X X X X IV. Cautions about the data-- 1. I am not a statistician; I am an engineer who has been programming computers in control and lab and data acquisition situations for about 20 years. The result is that I don't know how to be sure that 10,000 samples is enough to really characterize any of these tests. 2. The KWV11-C equivalent programmable real-time clock was running with a 1 MHz time base. This means that the clock resolution is 1 micro- second and that the maximum synchronization error is also about 1 microsecond for starting and for reading the clock. 3. The memory boards on each of the systems are not DEC memory and I did not have a way to try different memory boards to see how that might affect the timings. The 11/73 has a 1Mb National Semiconductor memory and the 11/23 has a 256Kb National Semiconductor memory. 4. The versions of RT-11 run on the two systems were not the same. The 11/73 runs TSX-Plus all of the time and has not been upgraded from RT-11 V5.1C yet. The 11/23 was running RT-11 V5.2.