DBGen Synthetic Data Generator for SQL Tables and Text files on Windows Platforms
1999Ó
Jim Gray (mailto:Gray@microsoft.com),
Andrey Barkhatov (barhat@obninsk.su)
December 1999 release -- Version 2
These
programs are provided "as is", without any express or implied
warranty.
Microsoft
Corporation kindly hosts this web site, but Microsoft does not endorse or
support the software or documentation.
The
program is freeware. It is freely
distributed subject to the restriction that this HTML readme file be included
with any distribution (so that we can continue to disclaim responsibility for
any problems that may arise from it).
DBGen generates synthetic test data of many datatypes and distributions as a comma separated text file or SQL table via ODBC(v3.0).
It was developed and tested with Microsoft SQLServer on Win32 platforms. The program implements many of the algorithms proposed in:
Quickly Generating Billion-Record Synthetic Databases, Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, Peter J. Weinberger, SIGMOD Conference 1994: 243-252. An earlier version of the paper is at the web site SyntheticDataGen.doc. (200 KB)
or SyntheticDataGen.pdf. (100KB)
A Zip file containing the source code, a Microsoft VisualStudio 6.0 project, and all the documents on this web site is at DBGen.Zip. (170KB).
Data types supported are
Char (length)
Varchar (length)
Decimal (precision, scale),
Datetime (mm/dd/yyyy hh:mmAM),
Tinyint (1byte),
Smallint (2byte),
Int (4byte)
Additionally for ODBC targets
Real (4byte) and
Float (8byte)
Distributions supported are:
Constant
Ordinal
Random
Normal (Gauss)
Exponential,
Poisson,
Self-similar (Zipfian)
Distributions can optionally be unique, and for ODBC destinations the field can be a key.
Min and Max values can be specified for all distributions in the natural format (e.g. number, string, or date). To specify constant value use ordinal distribution with Min=Max. When Min and Max are empty, default values are assigned (a very small min and a very large max).
For Random, Gauss and exponential distributions, the min and max values define the mean and standard deviation. The Mean Value, and standard deviation parameters make sense for Normal, Poisson, and Self-similar distributions. Standard deviations are specified as numbers. Zipfian and Self-Similar standard deviations are values within (0-1). Poisson standard deviations should be small compared to max-min.
Datetime is in minutes from 1753. Leap years are not observed (not Y2K compliant.). When varying Character strings only vary in the first four characters, the rest of the string is "blah-blah..".
The Galois group of integers under multiplication is used for random number generator, which gives unique pseudo-random values throughout sequence. The values are then converted to reals, strings, or dates as needed -- still preserving uniqueness.
DBGen has a graphical user interface that should be fairly intuitive. It also has a command line interface.
The default database name is "tempdb" and the default server name is "(local)".
To edit values in the sheet click the proper field or select and press Enter. There are pulldown menus hiding under many fields. ("Key" and "Unique" fields are only clickable radio buttons). To move between fields in the sheet use tab or next-field keys. An active edit control must be closed (hitting Enter) prior to editing next field (yes, we know this is a pain). To add or remove output- table columns, change Field_Number and hit Tab or enter. The Record Count field tells the system how many records to generate. The generator can write to either or both a file and an ODBC destination. ODBC 3.0 is the default, select the "Old_Odbc" optionon to access the ODBC 2.0 driver.
Generation settings can be loaded from a file specified in the command line (for example DBGen.exe config.txt). Here is a sample Config.txt from the web site.
The syntax is Token=Value separated by a new line from the next token,value paper.
In field definitions, "Field_Name" should be set first, "Type” second, Distribution parameters should follow the "Distr" token. Reading the example is probably the simplest thing.
The ODBC driver must to install to execute the program. Note, that many commonly used packages, such as MSOffice, already install the necessary DLLs, so you should not do it once more.
When generating, DBGen generates one beep for success, two for failure: not exactly EOC compliant.
Hints and error messages print in the status bar.
The code can be compiled with the defines
WIN32,
_CONSOLE,
NO_ODBC.
For example, the "-D_CONSOLE -DNO_ODBC" options give a console-mode app with text file (no ODBC) outputs.
The "-DWIN32" option inhibits the windowing interface.
Feedback is welcome.
Audrey wrote this code with some (little) consulting from Jim.
To download the WinZip file (documentation and a C++ project) ask for DBGen.zip
(170KB)