Sort
FAQ (3/18/2003, new things in GREEN) (mostly PennySort
stuff)
Are there any requirements that
are considered 'common knowledge' that may surprise those novices?
There is a "reasonable OS"
requirement, since we are trying to test the IO subsystem Of the OS. If you
think the OS should do it, then it should. We are not OS guys, we are OS
users. So, it should do what you would expect the OS to do if you were a
DB guy or a graphics guy or a word processing guy.
Will Linux be allowed to be used
for the competition?
Sure! You can use any
hardware/software you want.
Can we write our own file
system? Can we define partitions, striping, etc. as we see fit?
This is a test of the OS file
and IO system and the underlying hardwared. If the OS or
its extensions support partitioning among disks or anything else, cool!!!
Can we fix bugs in the OS or do we have to
use available patches or service packs?
Yes, you can fix bugs, just make sure that the IO library still works as
advertised.
You are welcome to
use patches as well
Can
we use assembler?
Yes, but it is a bit
surprising if you need to. This is a test of the OS IO system not of the
compilers.
Can
we over-clock the CPU?
You should just use the system as priced. If you need a faster cpu, buy it. This is a software test, not a hardware test. Money is the constraint. Over-clocking warps the money constraint. So, please just use the off-the-shelf devices.
Can we use raw devices for temporary space?
No. This is a test of the file system, OS and hardware.
How can
I get the original data file?
Every entrant should use the same data
generator SortGen.zip.
You may download it at: http://research.microsoft.com/barc/SortBenchmark/SortGen.zip
(SortGen.C is the source, and SortGen.exe
is the Windows/Intel binary and it uses stdio.h in C.)
Usage:
SortGen #Records FileName
(Each record is 100 bytes long.)
Must the output of SORTGEN be
left exactly as it is generated?
To stand on the
common ground for all entrants, you shouldn¡¯t modify the data file generated by
SortGen to get a higher performance.
Can the data be
compressed on disk.
No. The data should not be compressed; we
want 100 MB of data to generate 100 MB of IO.
Can we control the placement
of the file on the drive surface?
Yes, you can control
placement (e.g. outer bands on the disk).
You can place the file any way you like. I particular, you will probably want to
defragment the disk and also place the file on the outer bands on the disk.
Can we store each record as a
file instead?
The requirement is
that STDIO or the OS equivalent of it can read the input and output file as a
single file.
Since the last 80 bytes of
each 100-byte record generated by SortGen.exe can be derived from the first
20 bytes, can we sort the first 20 bytes and compute the rest to form the result
file?
No, you must move all the bytes, since this
is an IO benchmark. You must sort all of
the data -- that's what the contest is all about. :-) Internally, in RAM
memory, you are allowed to use pointers
to the data and then write out the data in order of the pointers.
What order should the sorted file apply?
There are three popular orders:
(a)
The ASCII binary order strncmp
(
):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ [To get this order, simply use
strcmp() and memcmp().]
(b)
The case-sensitive order used by NTSort
(the SORT command in Windows NT/2000/XP):
'-
!"#$%&()*,./:;?@[\]^_`{|}~+<=>0123456789AabBcCdDEeFfGghHIijJkKlLmMnNoOPpQqrRsStTuUvVwWxXYyZz
(c)
The strnicmp
(
)order
(NTSort default locale with "C", by using
"sort /l "C" source /o dest"):
!"#$%&'()*+,-./0123456789:;<=>?@[\]^_`AaBbCcdDeEFfGgHhiIjJKkLlMmnNoOpPqQrRSstTUuVvwWXxYyZz{|}~
[To get this order, use stricmp() or strnicmp().]
And then there are the
various Unicode and code page orders.
For the competition you can use any order you want ¨C in particular the
binary order (a or c) is typically fastest. In the standard C library this is strncmp
(
) or strnicmp
(). The validator uses and encourages the
use of strnicmp
(
).
The
output data
What
is the required output format?
Output file is a
sorted permutation of the input file (same checksum). It should be a sequential
file, as seen by fopen
(
) and a linear seek or whatever other file
manipulation verbs are native to your operating system..
Should the output file be
sequential sectors on the surface of a disk?
It needn¡¯t to be.
What we need is a sequential file that can be opened by the file system.
How can I deal with the
temporary files? Must I delete them? What about the format on striped disk systems?
File space must be zeroed or the OS must
have some other strategy to prevent unprivileged users from seeing your files
after you delete them. This is a standard OS security issue. There
are ways to avoid zeroing. Indeed this was a big performance improvement
in some operating systems.
Can we leave the sorted file
in two disks? I mean, each disk will have only 1/2 of the result, and only if I
simply concatenate them together, it will be the final result.
No, input and output has to look like a
single file (this is a file system test), so if you want to distribute the data
file on different disks, you need use the OS RAID software.
Can the sort output destroy (reuse) the sort input file?
Yes and no. The sort must create a new output file and
allocate space for it. But, it can delete the old file and reclaim that space
if that is needed to run the task.
That is at the end of phase 1 (when the runs have been generated) the
program could delete the input file and then create the output file and merge
the runs into that new file.
How can we verify
the correctness of the results?
The program ChkSort,
a computes testing checksum and order of records, and include it in the
website. You may download it at: http://research.microsoft.com/barc/SortBenchmark/ChkSort.zip. It uses strnicmp
(
), if you observe a different
sort order you may have to modify it.
How can we calculate the PennySort price and
performance?
1)
Assuming your computer has a lifetime of three years,
which is T = 94,608,000 seconds, divide T by your system cost (in pennies) to
get a time budget for one penny. This is the capital cost of that amount of cpu time. It does not include electricity, or people or rent
on the space or air conditioning and the many other costs. So it is optimistic.
2)
Sort the maximum amount of data in your time budget.
How is time measured?
Wall clock time from the submission of the
sort to the response from the sort after it closes (and flushes) the output
file.
How is system cost measured?
PennySort
has a "shopping" component that
makes it very hard to compare results. We are trying to capture price-performance
without turning this into a shopping expedition. For 2003 we propose the
following rules: One can buy parts from any sources, but
all the parts should be listed on only one of the famous websites (such as
ussa.com, compusa.com,
techdepot.com, or spartantech.com,). Pricing should be based on a single website
and checked by the auditor ¨C all prices in dollars. Tax, shipping are not included but each entrant should add an assembly fee
($35) for each box. The price of the monitor, keyboard, and
mouse are excluded from the cost but the system should have memory and
persistent storage, processor, networking (100Mbps or better), and cabinet
(fans, power, ¡).
Is software cost included in
the system cost?
This is a major
change. We don't include the
development costs of the competition software, so we have decided not to price
the sort software or the OS software as part of the system price. That makes
the system price just the hardware price.
What should be submitted by April
1st, source code or binary code?
The auditor will check the code and
pricing, so all that needs to be submitted to the contest is the report
describing the work and the system and the auditor¡¯s endorsement?
Whom should our results be submitted to?
Submit report to Jim Gray ¨C
Gray@microsoft.com.
How can we get our report audited?
The basic rule is
that you need to find an auditor. Any
past winner of any sort contest is an accredited auditor.
If none of the previous winners is available, please get
them certified by a faculty member at your university.
Is there any trophy for the
winners?
Yes. Winners get a glitzy trophy, including
a cup for your team and a medal for each team member. You can keep the cup
until the next winner of the next year comes out, and you can keep the medal
forever as a lifelong souvenir.

The trophy
What else will we get from
joining the competition?
You will have a chance to set a new world
record, which evaluates the rapid progress has been making on hardware and
software.
If you win, your results will be listed on
the Sort Benchmark
homepage, and your report will be a good reference to the researchers who
interest in this field in the world. If so, you¡¯ll be proud of your
accomplishment, since it is rare in one's life to be the best-in-the-world in
something, and to be recognized for it.
When and where are the trophies
awarded?
Trophies are awarded
each year at ACM SIGMOD at an informal ceremony if the winners are present. Awards happen at the conference, but it is not sponsored by
SIGMOD. This is an informal thing.
It is also a
convenient opportunity for the winners to buy dinner for the previous winners
(an informal tradition.) If the
winners cannot be present then the trophies are mailed to them. It is quite possible to move the
¡°ceremony¡± to the host country (for example the photo below).

Jim Gray was awarding the 2002 PennySort (Daytona category) winner, the Tsinghua Team from