DBGen Synthetic Data Generator for SQL Tables and Text files on Windows Platforms

1999Ó

 Jim Gray (mailto:Gray@microsoft.com),

Andrey Barkhatov (barhat@obninsk.su)

December 1999 release --  Version 2

/DBGen/

 

These programs are provided "as is", without any express or implied warranty.  

 

Microsoft Corporation kindly hosts this web site, but Microsoft does not endorse or support the software or documentation.

 

The program is freeware.  It is freely distributed subject to the restriction that this HTML readme file be included with any distribution (so that we can continue to disclaim responsibility for any problems that may arise from it).

 

DBGen generates synthetic test data of many datatypes and distributions as a comma separated text file or SQL table via ODBC(v3.0). 

It was developed and tested with Microsoft SQLServer on Win32 platforms.  The program implements many of the algorithms proposed in:

 

Quickly Generating Billion-Record Synthetic Databases, Jim Gray,  Prakash Sundaresan, Susanne Englert, Ken Baclawski, Peter J. Weinberger, SIGMOD Conference 1994: 243-252.  An earlier version of the paper is at the web site SyntheticDataGen.doc. (200 KB)

or SyntheticDataGen.pdf.  (100KB)

A Zip file containing the source code, a Microsoft VisualStudio 6.0 project, and all the documents on this web site is at DBGen.Zip. (170KB).

 

Data types supported are

Char (length)

Varchar (length)

Decimal (precision, scale),

Datetime (mm/dd/yyyy hh:mmAM),

Tinyint (1byte),

Smallint (2byte),

Int (4byte)

Additionally for ODBC targets

Real (4byte) and

Float (8byte)

 

Distributions supported are:

                Constant

Ordinal

Random

Normal (Gauss)

Exponential,

Poisson,

Self-similar  (Zipfian)

Distributions can optionally be unique, and for ODBC destinations the field can be a key.

 

Min and Max values can be specified for all distributions in the natural format (e.g. number, string, or date).  To specify constant value use ordinal distribution with Min=Max.  When Min and Max are empty, default values are assigned (a very small min and a very large max).

 

For Random, Gauss and exponential distributions, the min and max values define the mean and standard deviation.  The Mean Value, and standard deviation parameters make sense for Normal, Poisson, and Self-similar distributions.  Standard deviations are specified as numbers.  Zipfian and Self-Similar standard deviations are values within (0-1).  Poisson standard deviations should be small compared to max-min.

Datetime is in minutes from 1753. Leap years are not observed (not Y2K compliant.). When varying Character strings only vary in the first four characters, the rest of the string is "blah-blah..".

 

The Galois group of integers under multiplication is used for random number generator, which gives unique pseudo-random values throughout sequence.  The values are then converted to reals, strings, or dates as needed -- still preserving uniqueness.

 

DBGen has a graphical user interface that should be fairly intuitive.  It also has a command line interface.

The default database name is "tempdb" and the default server name is "(local)".

 

To edit values in the sheet click the proper field or select and press Enter. There are pulldown menus hiding under many fields.   ("Key" and "Unique" fields are only clickable radio buttons).  To move between fields in the sheet use tab or next-field keys.  An active edit control must be closed (hitting Enter) prior to editing next field (yes, we know this is a pain).  To add or remove output- table columns, change Field_Number and hit Tab or enter.   The Record Count field tells the system how many records to generate.  The generator can write to either or both a file and an ODBC destination.   ODBC 3.0 is the default, select  the "Old_Odbc" optionon to access the ODBC 2.0 driver.

 

Generation settings can be loaded from a file specified in the command line (for example DBGen.exe config.txt).  Here is a sample Config.txt from the web site.

The syntax is Token=Value separated by  a new line from the next token,value paper.

In field definitions, "Field_Name" should be set first, "Type” second, Distribution parameters should follow the "Distr" token.  Reading the example is probably the simplest thing.

 

The ODBC driver must to install to execute the program.   Note, that many commonly used packages, such as MSOffice, already install the necessary DLLs, so you should not do it once more.

 

When generating, DBGen generates one beep for success, two for failure: not exactly EOC compliant.

Hints and error messages print in the status bar.

 

The code can be compiled with the defines 
    WIN32, 
    _CONSOLE, 
    NO_ODBC.
For example, the "-D_CONSOLE -DNO_ODBC" options  give a console-mode app with text file (no ODBC) outputs. 
The "-DWIN32" option inhibits the windowing interface.

 

Feedback is welcome.
Audrey wrote this code with some (little) consulting from Jim.


To download the WinZip file (documentation and a C++ project) ask for DBGen.zip (170KB)