URL shortener’s have proliferated in the past few years, mainly due to the confines of data length that mobile and social networks like Twitter apply. The following code example shows how to make a simple and efficient URL shortener, with plenty scope for improvement.
Although in this example I am going to use localhost as the serving domain, you’d be looking to use as short a domain name as possible in production. You have a fairly good chance of securing a 5 character domain that’d result in 7 character short URLs to begin with.
This example assumes you’re on a 64-bit system that uses a web server capable of rewriting, and uses PHP/MySQL. I use the hashing function murmurhash() as it’s known to be very quick and effective with short strings, though it’s not native to standard PHP installations. You can follow these instructions to install murmurhash. If you do change the hashing method, just ensure you also update the table schema in MySQL (one field is used for the hash values).
Also, I’ve used MySQL partitioning which makes lookups more efficient, but it’s not necessary for the working of the script.
The HTML layout of it is extremely simple, in fact you’ll only see a form and the display of newly created short URLs.
Other than displaying HTML contents, the concept is quite simple:
— Allow a user to submit a long URL and convert it into a short one
— Deal with requests for short URLs and redirect them to long ones.
This, essentially is what a URL shortener is. Some services will give you nice statistics and graphs about the people who visit a short URL. The code provided here is simple and extensible enough for you to do that should you wish.
There are 3 locations where data is stored in this script, 2 MySQL tables and one flatfile. One table is for inserts, one for selects and the flatfile contains the long URLs in the system.
INSERT TABLE
When new URLs are added, a quickish method is needed to see whether the URL already exists in the database or not. This is done by creating a 64-bit hash from the contents of the URL, though you can use whichever hashing method and size of data you wish. 8 bytes is a fairly good size for avoiding collisions while not being too large a key.
insert_table is partitioned (if you choose to) and contains three other fields…
— fileoffset – A pointer to a position in the flatfile that contains the URL in question. Since hashes can collide, all matching hash values are checked until the corresponding URL is found (or not found).
— urllength – Also part of the primary key, this is used to further reduce the potential result set in the case of collisions. Only more than one result will appear for URLs that match the hash and also match the urllength.
— id – The unique incremental ID of the URL, this is converted into short URLs. In cases where someone submits a URL to be shortened that already exists in the database, this datapoint is used.
After a long URL is submitted, the short version is returned to the end user that they can use.
SELECT TABLE
select_table simply holds the unique incremental ID of the URL and a fileoffset for where to find the long URL. It is used when someone load up a short URL and needs redirected to a long URL.
URL FILE
The URL file is simply a raw list of long URLs entered into the system
THE CODE
Without further ado, here’s the code in order to try it out.
First off, we want to redirect all requests that may be shortened URLs to our single script. Something like this in .htaccess does the trick.
RewriteEngine On
RewriteRule !(^shortener.php|/) /shortener.php [L]
This rule basically means “if the URL is not shortener.php and does not contain a forward slash, redirect to the URL shortener”. This will allow you to create extra pages on the domain that won’t be redirected, but they’ll need to have a forward slash included in the URL.
Now, add this SQL schema to a database of your choice.
CREATE TABLE `insert_table` (
`hashid` bigint(20) NOT NULL,
`fileoffset` bigint(20) unsigned NOT NULL,
`urllength` smallint(5) unsigned NOT NULL,
`id` int(10) unsigned NOT NULL,
PRIMARY KEY (`hashid`,`urllength`,`id`)
) ENGINE=InnoDB;
CREATE TABLE `select_table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`fileoffset` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1;
Not required but recommended, also apply the following SQL. Partitions can be created during table creation but I’ve separated the two concepts here for clarity.
ALTER TABLE insert_table PARTITION BY KEY(hashid) PARTITIONS 32;
ALTER TABLE select_table PARTITION BY KEY(id) PARTITIONS 32;
Note that partitions are usually stored as separate files and mostly treated as separate tables. You can have up to 8192 partitions on more recent versions of MySQL, which take longer to create but certainly should help scaling. You may come into some issues with operating/user system open file limits if you put a higher number, though it is trivial to change after a quick Google.
In more recent versions of MySQL you can also reference particular partitions directory, which can help the query optimiser pick the correct/minimal partitions, particularly when using JOIN’s.
Here’s the PHP driving it all…
error_reporting(E_ALL);
ini_set('display_errors', 1);
class shortener {
var $domain = 'localhost'; // URL shortener domain, used to display shortened URLs and to prevent shortened URLs pointing back to this service
var $index = '0ABCDEFGHIJKLMNOPQRSTUVWXYZ123456789abcdefghijklmnopqrstuvwxyz$-_.+!*\'(),'; // Valid characters to use in shortened URLs
var $base = 73; // Number of characters used in $index
var $form = '<form method="post">URL: <input type="text" name="url" /><input type="submit" value="Shorten URL" /></form>'; // Form to submit long URLs
var $file = 'urls.txt'; // File containing
var $fp; // File handle for $file
var $db; // Database connection
/* Main construct, either:
1. - Show the form to enter long URLs to be converted to short URLs
2. - Deal with a submitted long URL
3. - Examine a short URL and redirect it to a longer URL
*/
function __construct() {
touch($this->file); // Remove me after first invocation
if(strlen($_SERVER['REQUEST_URI']) == 1) {
if(!isset($_POST['url']))
die($this->form);
else
$this->insert_request();
}
else
$this->select_request();
}
// Connect to database
private function db() {
$this->db = new mysqli('localhost','root','root','shortener'); // Change these credentials to your own
if($this->db->connect_errno)
die("Failed to connect to MySQL: (" . $this->db->connect_errno . ") " . $this->db->connect_error);
}
// 2. Deal with a submitted long URL to be converted to a short URL
private function insert_request() {
// Parse URL into its components
$p = parse_url($_POST['url']);
if(!isset($p['scheme'],$p['host']))
die('URL was invalid<hr>'.$this->form);
// Normalise
$p['host'] = strtolower($p['host']);
if(!isset($p['path']))
$p['path'] = '/';
if(isset($p['query']))
$p['query'] = '?'.$p['query'];
$p['scheme'] .= '://';
// Ensure its not a link back to this shortener that could cause an infinite loop
if($p['host'] == $this->domain || preg_match("'.".$this->domain."$'",$p['host']))
die('URL was invalid<hr>'.$this->form);
// Variables and resources needed for URL lookup
$url = implode('',$p);
$urllength = strlen($url);
$hash = (murmurhash($url) << 32) | murmurhash(substr($url,1));
$id = -1;
$this->fp = fopen($this->file, "r+");
$this->db();
// Query the table to see if the URL exists or needs to be inserted
$result = $this->db->query('SELECT fileoffset,id
FROM insert_table
WHERE hashid = '.$hash.' AND urllength = '.$urllength) or die($this->db->error);
while($row = $result->fetch_array(MYSQLI_ASSOC)) {
fseek($this->fp,$row['fileoffset'],SEEK_SET);
if(fread($this->fp,$urllength) == $url) {
$id = $row['id'];
break;
}
}
// New URL
if($id < 0) {
// Get an exclusive lock to the URL file list and write the new URL
fclose($this->fp);
$this->fp = fopen($this->file,'r+');
fseek($this->fp,0,SEEK_END);
flock($this->fp,LOCK_EX);
$ftell = ftell($this->fp);
fwrite($this->fp,$url."\n");
fclose($this->fp);
// Insert the fileoffset into the select lookup table and get the auto incrementid
$this->db->query('INSERT IGNORE select_table (fileoffset) VALUES ('.$ftell.')') or die($this->db->error);
$id = $this->db->insert_id;
// Write id to insert lookup select
$this->db->query('INSERT IGNORE insert_table (hashid,fileoffset,urllength,id) VALUES ('.$hash.','.$ftell.','.$urllength.','.$id.')') or die($this->db->error);
}
// Convert the ID to a tiny URL path
$tiny = '';
for($t = floor(log10($id) / log10($this->base)); $t >= 0; $t--) {
$a = floor($id / pow($this->base,$t));
$tiny .= $this->index[$a];
$id = $id-($a*pow($this->base,$t));
}
die('Your tiny URL is http://'.$this->domain.'/'.$tiny.'<hr>'.$this->form);
}
// 3. Takes a short URL and converts it to a long one
private function select_request() {
$rindex = array_flip(array_slice(preg_split("''",$this->index),1,-1));
$dec73 = substr($_SERVER['REQUEST_URI'],1);
$len = strlen($dec73);
$id = $rindex[$dec73[$len-1]];
$k = 1;
for($j = $len-2;$j >= 0;$j--) {
if(!isset($rindex[$dec73[$j]])) {
header('HTTP/1.1 400 Bad Request');
die('The requested URI does not conform to this URL shortener\'s syntax');
}
$id = bcadd($id,bcmul($rindex[$dec73[$j]],bcpow(73,$k)),0);
$k++;
}
$this->db();
$result = $this->db->query('SELECT fileoffset
FROM select_table
WHERE id = '.$id.'
LIMIT 1');
if(!$row = $result->fetch_array(MYSQLI_ASSOC)) {
header('HTTP/1.1 404 Not Found');
die('The requested URI does not exist on this server');
}
$this->fp = fopen($this->file,'r');
fseek($this->fp,$row['fileoffset']);
header('Location: '.trim(fgets($this->fp)));
fclose($this->fp);
exit(0);
}
}
$_s = new shortener;
Some Notes and Possible Improvements
— As is, the shortener is fine for personal use or use within a trusted network.
— You may want to log some details about the users submitted URLs, in case particular users become problematic by submitting garbage, filling up your database or link to material you believe should not be linked to.
— Along the same lines, you may want to rate limit new URL submissions per IP or per session cookie.
— Some basic statistics about the number of visitors to each URL may be useful.
— Along the same lines, you may want to build a memory cache based on popular URLs.
— Occasionally you will want to rebuild the partitions in the insert_table. This is because the hash values are inserted in a random order and gradually fragment the table. Having partitions helps you do this process incrementally and continue to be able to serve requests (you would need some kind of indication that a particular partition is ‘busy’ and copy the contents of the partition somewhere temporary in order to continue serving requests for data within it, until the partition is fully rebuilt).
— You could sacrifice some processing for more optimal storage of URLs by converting common components of a URL into an integer flag. For example, most if not all URLs would be either HTTP or HTTPS and that only requires 1 bit of information to distinguish them from each other. “www.” in hostnames is another common component. TLD’s are another, as are file extensions.
— Pre filling the URL file with a large amount of space would avoid fragmentation of the file due to the small incremental writes on it. You’d want to mark somewhere how much data is actually in the file as the script currently just seeks to the end of it to write new data. (The same idea could be applied to the MySQL insert_table)
— Having multiple disks (preferably SSD) would obviously help with an IO contention. Also having a more logical ordering of the URLs (by length for instance) could rid you of fileoffsets in the database altogether, because you’d only need to know the unique ID and the length of a short URL in order to find its longer counterpart. I deliberately made this example code simple in order to not have too many open file handles.