Tip: Storing MD5 Values (and other string/binary representations)

Tip: Storing MD5 Values (and other string/binary representations)

A common occurrence I have noticed in MySQL apps is that MD5 values are stored as 32 byte values rather than 16. Just to ‘rehash’, an MD5 value is a 16 byte hexadecimal value, typically used as a unique fixed-length signature of a string, useful for identifying unique strings or one-way encryption of passwords. The binary representation takes 16 bytes, though a human readable hexadecimal version takes twice as many.

The same goes for any of the other hashing techniques. They tend to output a friendly hex format, useful in a number of cases like in Javascript or within a particular format such as CSV or TSV (the random binary bytes would mess up the delimiting of data). When you’re looking to store these values though, most of the time it makes sense to have them in their shorter binary representation.

Another common example is IP addresses, I often see VARCHAR(16) for IPv4 addresses. Perhaps when IPv6 is more commonplace we will see VARCHAR(64) instead. IPv4 addresses are 32-bit values and can be stored as an UNSIGNED INT (4 bytes), while IPv6 addresses are 128-bit. There isn’t a native 16-byte integer type in MySQL so a BINARY(16) or two UNSIGNED BIGINT fields would do, though perhaps software will address this as IPv6 gains adoption.

When doing lookups on these kinds of fields, you want them as small as possible so that they can fit neatly into indexes and less processing time is spent evaluating them.

The following is a simple test to compare speeds of a CHAR(32) MD5 column versus a BINARY(16)

CREATE TABLE IF NOT EXISTS `md5values16` (`hash` BINARY(16) NOT NULL) ENGINE=myisam;
CREATE TABLE IF NOT EXISTS `md5values32` (`hash` CHAR(32) NOT NULL) ENGINE=myisam;

The MD5 values that are inserted are deliberately left-padded with 0’s to emphasise the fact that field lengths do make a difference when searching on a field, regardless of whether the field is indexed or not. This is because we’re only populating the table with ~2^20 rows, whereas random MD5s have 2^128 possible values. If we just used random MD5s then MySQL would only have to examine the 1st byte or two due of our small dataset and there would be negligible difference in our small sample. Over millions of runs, or a larger dataset… the difference grows.

$db = new mysqli('localhost','root','root','test'); // Change these credentials to your own
	
    if($db->connect_errno)
        die("Failed to connect to MySQL: (" . $db->connect_errno . ") " . $db->connect_error);    

// Empty the tables
$db->query('TRUNCATE TABLE `md5values16`; ') or die($db->error);
$db->query('TRUNCATE TABLE `md5values32`; ') or die($db->error);

// Padding used for string and binary representation
$chr = chr(0);
$bin = str_pad($chr,12,$chr);
$str = str_pad("0",24,"0");

// Insert values 10k a time
echo "Populating tables...\n";
for($i = 1;$i < 1000000;$i+= 10000) {
    echo $i,' ';
    $array = array(array(),array());
        for($j = $i;$j < $i + 10000;$j++) {
            $array[0][] = '(\''.$str.array_shift(unpack('H*',pack('N',$i+$j))).'\')';
            $array[1][] = '(\''.$db->real_escape_string($bin.pack('N',$i+$j)).'\')';
        }
    $db->query('INSERT INTO md5values32 VALUES '.implode(',',$array[0])) or die($db->error);
    $db->query('INSERT INTO md5values16 VALUES '.implode(',',$array[1])) or die($db->error);
}

function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}


echo "\nComparing speed for hits...\n";
 
$time_start = microtime_float();
    for($i = 2;$i < 202;$i++) 
        $db->query('SELECT hash FROM md5values16 WHERE hash = \''.$db->real_escape_string($bin.pack('N',$i)).'\'') or die($db->error);
echo (microtime_float() - $time_start)." seconds for 200 16 byte column hits\n";


$time_start = microtime_float();
    for($i = 2;$i < 202;$i++) 
        $result = $db->query('SELECT hash FROM md5values32 WHERE hash = \''.$str.array_shift(unpack('H*',pack('N',$i))).'\'');
echo (microtime_float() - $time_start)." seconds for 200 32 byte column hits\n";


echo "\nComparing speed for misses...\n";
 
$time_start = microtime_float();
    for($i = 0;$i < 200;$i++)
        $db->query('SELECT hash FROM md5values16 WHERE hash = UNHEX(MD5(\''.mt_rand(1,10000000).'\'))');
echo (microtime_float() - $time_start)." seconds for 200 16 byte column misses\n";


$time_start = microtime_float();
    for($i = 0;$i < 200;$i++) 
        $db->query('SELECT hash FROM md5values32 WHERE hash = MD5(\''.mt_rand(1,10000000).'\')');
echo (microtime_float() - $time_start)." seconds for 200 32 byte column misses\n";

Output may be similar to

Populating tables...
1 10001 20001 30001 40001 50001 60001 70001 80001 90001 100001 110001 120001 130001 140001 150001 160001 170001 180001 190001 200001 210001 220001 230001 240001 250001 260001 270001 280001 290001 300001 310001 320001 330001 340001 350001 360001 370001 380001 390001 400001 410001 420001 430001 440001 450001 460001 470001 480001 490001 500001 510001 520001 530001 540001 550001 560001 570001 580001 590001 600001 610001 620001 630001 640001 650001 660001 670001 680001 690001 700001 710001 720001 730001 740001 750001 760001 770001 780001 790001 800001 810001 820001 830001 840001 850001 860001 870001 880001 890001 900001 910001 920001 930001 940001 950001 960001 970001 980001 990001 
Comparing speed for hits...
10.276722192764 seconds for 16 byte column
16.271685123444 seconds for 32 byte column

Comparing speed for misses...
10.543118953705 seconds for 16 byte column
10.806349039078 seconds for 32 byte column

whoami
Stefan Pejcic
Join the discussion

I enjoy constructive responses and professional comments to my posts, and invite anyone to comment or link to my site.