Sort FAQ (3/18/2003, new things in GREEN) (mostly PennySort stuff)

 

Common knowledge. 1

The input data. 1

The sorting process. 2

The output data. 2

Measuring the Sort 3

Report submission. 4

Award. 4

 

Common knowledge

Are there any requirements that are considered 'common knowledge' that may surprise those novices?

There is a "reasonable OS" requirement, since we are trying to test the IO subsystem Of the OS. If you think the OS should do it, then it should. We are not OS guys, we are OS users.  So, it should do what you would expect the OS to do if you were a DB guy or a graphics guy or a word processing guy.

 

Will Linux be allowed to be used for the competition?

Sure!  You can use any hardware/software you want.

 

Can we write our own file system? Can we define partitions, striping, etc. as we see fit?
This is a test of the OS file and IO system and the underlying hardwared. If the OS or its extensions support partitioning among disks or anything else, cool!!!

 

Can we fix bugs in the OS or do we have to use available patches or service packs?

Yes, you can fix bugs, just make sure that the IO library still works as advertised.
You are welcome to use patches as well

 

Can we use assembler?

Yes, but it is a bit surprising if you need to.  This is a test of the OS IO system not of the compilers.

 

Can we over-clock the CPU?

You should just use the system as priced.  If you need a faster cpu, buy it.  This is a software test, not a hardware test.  Money is the constraint.    Over-clocking warps the money constraint.  So, please just use the off-the-shelf devices.

 

Can we use raw devices for temporary space?

No.  This is a test of the file system, OS and hardware.

The input data

How can I get the original data file?

Every entrant should use the same data generator SortGen.zip.

You may download it at: http://research.microsoft.com/barc/SortBenchmark/SortGen.zip

(SortGen.C is the source, and SortGen.exe is the Windows/Intel binary and it uses stdio.h in C.)

Usage:  SortGen  #Records  FileName

(Each record is 100 bytes long.)

Must the output of SORTGEN be left exactly as it is generated?

To stand on the common ground for all entrants, you shouldn¡¯t modify the data file generated by SortGen to get a higher performance.

 

Can the data be compressed on disk.
No.  The data should not be compressed; we want 100 MB of data to generate 100 MB of IO.

 

Can we control the placement of the file on the drive surface?

Yes, you can control placement (e.g. outer bands on the disk).  You can place the file any way you like.  I particular, you will probably want to defragment the disk and also place the file on the outer bands on the disk.

 

Can we store each record as a file instead?

The requirement is that STDIO or the OS equivalent of it can read the input and output file as a single file.

The sorting process

Since the last 80 bytes of each 100-byte record generated by SortGen.exe can be derived from the first 20 bytes, can we sort the first 20 bytes and compute the rest to form the result file?

No, you must move all the bytes, since this is an IO benchmark.  You must sort all of the data -- that's what the contest is all about. :-)  Internally, in RAM memory, you are allowed to use pointers to the data and then write out the data in order of the pointers.

 

What order should the sorted file apply? 

There are three popular orders:

(a)     The ASCII binary order strncmp ( ):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ [To get this order, simply use strcmp() and memcmp().]

(b)     The case-sensitive order used by NTSort (the SORT command in Windows NT/2000/XP):
'- !"#$%&()*,./:;?@[\]^_`{|}~+<=>0123456789AabBcCdDEeFfGghHIijJkKlLmMnNoOPpQqrRsStTuUvVwWxXYyZz

(c)     The strnicmp ( )order (NTSort default locale with "C", by using "sort /l "C" source /o dest"):
!"#$%&'()*+,-./0123456789:;<=>?@[\]^_`AaBbCcdDeEFfGgHhiIjJKkLlMmnNoOpPqQrRSstTUuVvwWXxYyZz{|}~ [To get this order, use stricmp() or strnicmp().]

And then there are the various Unicode and code page orders.

For the competition you can use any order you want ¨C in particular the binary order (a or c) is typically fastest.  In the standard C library this is strncmp ( ) or strnicmp ().  The validator uses and encourages the use of strnicmp ( ).

 

The output data
What is the required output format?

Output file is a sorted permutation of the input file (same checksum). It should be a sequential file, as seen by fopen ( ) and a linear seek or whatever other file manipulation verbs are native to your operating system..

 

Should the output file be sequential sectors on the surface of a disk?

It needn¡¯t to be. What we need is a sequential file that can be opened by the file system.

 

How can I deal with the temporary files? Must I delete them? What about the format on striped disk systems?

File space must be zeroed or the OS must have some other strategy to prevent unprivileged users from seeing your files after you delete them.  This is a standard OS security issue.  There are ways to avoid zeroing.  Indeed this was a big performance improvement in some operating systems.

Can we leave the sorted file in two disks? I mean, each disk will have only 1/2 of the result, and only if I simply concatenate them together, it will be the final result.

No, input and output has to look like a single file (this is a file system test), so if you want to distribute the data file on different disks, you need use the OS RAID software.


Can the sort output destroy (reuse) the sort input file?

Yes and no.  The sort must create a new output file and allocate space for it. But, it can delete the old file and reclaim that space if that is needed to run the task.  That is at the end of phase 1 (when the runs have been generated) the program could delete the input file and then create the output file and merge the runs into that new file.

 

How can we verify the correctness of the results?

The program ChkSort, a computes testing checksum and order of records, and include it in the website. You may download it at: http://research.microsoft.com/barc/SortBenchmark/ChkSort.zip.  It uses strnicmp ( ), if you observe a different sort order you may have to modify it.

Measuring the Sort

How can we calculate the PennySort price and performance?

1)         Assuming your computer has a lifetime of three years, which is T = 94,608,000 seconds, divide T by your system cost (in pennies) to get a time budget for one penny. This is the capital cost of that amount of cpu time. It does not include electricity, or people or rent on the space or air conditioning and the many other costs.   So it is optimistic.

2)         Sort the maximum amount of data in your time budget. 

 

How is time measured?

Wall clock time from the submission of the sort to the response from the sort after it closes (and flushes) the output file.

 

How is system cost measured?

PennySort has a "shopping" component that makes it very hard to compare results. We are trying to capture price-performance without turning this into a shopping expedition. For 2003 we propose the following rules: One can buy parts from any sources, but all the parts should be listed on only one of the famous websites (such as ussa.com, compusa.com, techdepot.com, or spartantech.com,). Pricing should be based on a single website and checked by the auditor ¨C all prices in dollars.  Tax, shipping are not included but each entrant should add an assembly fee ($35) for each box.   The price of the monitor, keyboard, and mouse are excluded from the cost but the system should have memory and persistent storage, processor, networking (100Mbps or better), and cabinet (fans, power, ¡­).

 

Is software cost included in the system cost?

This is a major change.  We don't include the development costs of the competition software, so we have decided not to price the sort software or the OS software as part of the system price. That makes the system price just the hardware price.

Report submission

What should be submitted by April 1st, source code or binary code?

The auditor will check the code and pricing, so all that needs to be submitted to the contest is the report describing the work and the system and the auditor¡¯s endorsement?

 

Whom should our results be submitted to?

Submit report to Jim Gray ¨C Gray@microsoft.com.

 

How can we get our report audited?

The basic rule is that you need to find an auditor.  Any past winner of any sort contest is an accredited auditor.  If none of the previous winners is available, please get them certified by a faculty member at your university.

Award

Is there any trophy for the winners?

Yes. Winners get a glitzy trophy, including a cup for your team and a medal for each team member. You can keep the cup until the next winner of the next year comes out, and you can keep the medal forever as a lifelong souvenir.

The trophy

 

What else will we get from joining the competition?

You will have a chance to set a new world record, which evaluates the rapid progress has been making on hardware and software.

If you win, your results will be listed on the Sort Benchmark homepage, and your report will be a good reference to the researchers who interest in this field in the world. If so, you¡¯ll be proud of your accomplishment, since it is rare in one's life to be the best-in-the-world in something, and to be recognized for it.

 

When and where are the trophies awarded?

Trophies are awarded each year at ACM SIGMOD at an informal ceremony if the winners are present.   Awards happen at the conference, but it is not sponsored by SIGMOD. This is an informal thing.

It is also a convenient opportunity for the winners to buy dinner for the previous winners (an informal tradition.)  If the winners cannot be present then the trophies are mailed to them.  It is quite possible to move the ¡°ceremony¡± to the host country (for example the photo below).

Jim Gray was awarding the 2002 PennySort (Daytona category) winner, the Tsinghua Team from China .