Integer PKs

The case for/against integer PKs in replication.

There have been numerous discussions on the subject of using integer versus GUID/UUID primary key values in your database. Some prefer to create the GUID/UUID on the client side, others would prefer to do it server side. These discussions have, for the most part, been motivated by the need for valid uniqueness across replicating server boundaries.

Some discussion is quite heated which betrays the emotion some people have towards this subject. We seem to own our methods and defend them with a vengeance.

I would like to investigate, here, the numerical range offered by integers in the task of primary key, and perhaps add some perspective to these arguments.

This may also serve to justify the use of FBReplicator and it’s assumed model of Integer value PKs.

Replication
Reasons for replicating

Ongoing Backup: There are many options available to a Firebird user for data backup. Database (or database) shadow, scheduled backups (gbak), hardware alternatives (e.g. RAID) and also replication.

Distributed access to data: A more common reason for database replication would be for the purpose of distributing access to server based resources. Allowing local LAN groups to access a server while the server accesses other servers on the WAN to keep data live across the organization. Certainly, when replication is carried out for this purpose, the added benefit of backup is built-in.

The effects of replication

Replication, in, and of itself, does not increase the total number of records to be inserted into a table. In some cases it is a means of distributing the task of data input. In others it offers a business model the opportunity to deliver equal levels of technology to geographically separated input stations. In only very few, it will deliver the possibility of increasing the overall size of the combined and expected databases and in these cases, the increase may only be marginal.

Replication is most often a variety of technological delivery, it is not necessarily a means of changing the underlying business model. If you are recording the number of property sales in a country, then deciding to replicate your servers which record this information does not increase the number of property sales which will take place in any one year. In terms of the technology of delivery, this number remains static.

If you are recording transactions in a department store (or the like) which has “unknown” growth potential, then this unknown growth potential will not be any less or more unknown just because you choose to include a replication process. Your expectations of total sales transactions should be planned under the guidance of a marketing department which would like to “sell multiple items to everyone in the world”. But there will undoubtedly be many factors which clearly put an upper limit on this “unknown”. See below.

The introduction of replication should be accompanied by a means of territorializing the data to be input. The data belongs to “this store” or “this branch” or “this state” etc. It’s not much good having duplicates of the same data across different servers. In fact, replication could be regarded as a means of reducing duplication since territorializing your data may reduce the total number of records across all data stores.

There is one significant disadvantage to the process of replication when we speak of the finite size of integers and the distribution of the record input process. All the schemes known to me (see below for a separate discussion) where integers are involved require that certain bands of values be assigned to each station. Under such a scheme, where some stations use many more values in their band compared with other stations which use only few in their band, an artificial upper limit may be imposed on those heavy traffic stations while the low traffic stations may end up monopolizing large bands and with low utilization. There is one method, explained later, which minimizes this effect. In contrast, the use of GUID/UUIDs for PKs result in an infinite range of values but a more finite time span for their use. In theory, the GUID/UUID solution can also be regarded as finite but in practice it is indeed infinite.{mospagebreak}

Integers

Relatively recently we have moved into the world of 64bit integer computing. The change in range from 32 bit integer computing is at times incomprehensible.

Integer – 32 bit: value range -2,147,483,648 to 2,147,483,648 i.e. total 4,294,967,296

Integer – 64 bit: value range -9,223,372,036,854,775,807 to 9,223,372,036,854,775,807 i.e. total 18,446,744,173,609,531,614

The nearest I can come to words for this is 18 thousand billion billion. (that may be the British/Australian terminology)

Some relatives to consider:

  • There are more people in the world than in the positive range of an Int32 but fewer than the absolute range.
  • The Int64 is so large that if you divide it by the highest value of an Int32 you get 4,294,967,296, and that’s only in the positive range.There are 2 x Int32 x Int32’s in an Int64. And that’s only in the positive range.
  • If Int32 were large enough, a few years ago, for you to be satisfied that your record numbers would be OK by using Int32 as PK, then you could now setup 2,147,483,648 replication server stations on tables which have a PK range which is twice the size of your old Int32 PKs. And that’s only in the positive range.
GUIDs

A UUID is 128 bits long, and if generated according to the one of the mechanisms in this document, is either guaranteed to be different from all other UUIDs/GUIDs generated until 3400 A.D, or extremely likely to be different (depending on the mechanism chosen). The time component of the UUID will likely rollover at this date in the future.

It’s such a shame to have a data type that is capable of delivering such an enormous total value range being cut down to a mere 1400 years. This is because we need such a huge base range to make the guarantee when we are talking about generating these IDs on a distributed basis.

Some real life examples:

At the rate of 1 insert per second the expected life span of a database table is shown below:

Insert rate (avg) 1 per 30 seconds 1 per second 10 per second
Insert total 1 million new records/year 31 million new records/year 310 million new records/year
32 bit Positive Range 2,108 years 68 years 6.8years
32 bit Absolute Range 4,216 years 136 years 13.6 years
64 bit Positive Range 9,000 billion years 297,528 million years 29,753 million years
64 bit Absolute Range 18,000 billion years 600 billion years 60 billion years

Obviously we can’t think of putting a 7 year life span on our software system or datastore but lets think about this for a minute.

A single data source, associated with automated data logging systems, are one way to create continual high rates of insert. Often, however, they are only logging 1 to few record(s) per second. Some systems might be intended to scale very high, e.g. POS systems where store numbers or POS stations numbers might increase to something past this highest rate. In these cases an insert rate of 30 per second (average over the year) could be possible and this would result in a life span of 10,000 million years.

If we were to use the life span associated with GUID/UUID PKs and make our calculation based on 1400 years, then an Int64 PK system would provide for a total of approximately 292 million inserts per second (average) for that period.

Most real world systems suffer from the peak/trough cycle syndrome, and we really should be looking at the average rate and not the peak. After all we are trying to ascertain the total annual record growth not the fastest record growth. In this light, 310 million records per annum looks ridiculous to many of us. What size database would that generate? Let’s take a stab.

One database, one table, 6 columns let’s say 128 bytes per row. This would be 40Gb in one year. And that’s a tiny database in terms of structure. What could you do with this database relationally? It’s just a flat file. Build some structure into it and you will probably need a lot of hard disk space in no time flat. There may only be a few of us to claim to have databases this size at the moment.

Now lets put some replication stations into the equation. Lets take 10,000 replication stations based on either a GAP or STATION NUMBER method (see below). Let’s also assume that due to duplication (failure to territorialize), or as part of a data model, a level of 30% duplication is achieved. Perhaps the technology is delivering data capture to areas which were previously unable to contribute data and so the 30% increase is real and not duplicate.

Insert rate (avg) 1.3 per 30 seconds 1.3 per second 13 per second
Insert total 1.3 million new records/year 40 million new records/year 400 million new records/year
32 bit Positive Range 1,651 years 54 years 5.4years
32 bit Absolute Range 3,302 years 108 years 10.8 years/
64 bit Positive Range 6,000 billion years 230,584 million years 23,058 million years
64 bit Absolute Range 12,000 billion years 541 billion years 46 billion years
If I had said, let’s take 100,000 replication stations instead of 10,000, would these numbers have changed? Not necessarily. I’m open to hearing about the odd case where a model is clearly going to increase the load shown here, but I would have to argue that most models will not have this impact.

All you are doing by increasing the station numbers is further distributing the business model and making each station reduce the numbers of records they insert. Conversely, if you are thinking that an increase in station numbers will increase the total record numbers, then I would argue that your model of total record numbers is in error or just plain short sighted in the first place.

As mentioned above, let’s take a department store which needs to make transaction records for each sales (perhaps multiple records per sale). You can dream of some enormous sales figure you’d love to show the Board but ultimately, someone is going to do the math on your estimate. Transactions per cities or towns possible to have stores per average number of stores per town per 24 hours per day per 365 days per year. You have to arrive at a number then multiply it by the number of years the store is going to be running your software. How long will it be before the store goes broke, or computers change so radically that they are forced to upgrade to 128 bit integer based computers. Let’s put those transaction numbers up by one thousand fold, to 1,300 per second.

That means the life of your system is cut short – to maybe only 60 million years.

But if each of the 100,000 stations were using GUID/UUID based PKs, they would still be limited to only 1,400 years.

If, in reverse, we said, we only want 1,400 years life out of our integers to match the life span of a GUID/UUID system, what would the table show then?

Insert life 1,000 years 1,000 years 1,000 years
10,000 stations 100 stations one station
32 bit Positive Range 0.068 inserts/sec 0.068 inserts/sec 0.068 inserts/sec
4 inserts/min 4 inserts/min 4 inserts/min
32 bit Absolute Range 0.136 inserts/sec 0.136 inserts/sec 0.136 inserts/sec
8.16 inserts/min 8.16 inserts/min 8.16 inserts/min
64 bit Positive Range 292 m inserts/sec 292 m inserts/sec 292 m inserts/sec
64 bit Absolute Range 584 m inserts/sec 584 m inserts/sec 584 m inserts/sec
Why are they all the same? Because the business model hasn’t changed. Only the technology has changed. If you think that there is an impact on the business model from this delivery method, then you can adjust these figures to suit.{mospagebreak}

The Replication Method

There have been several methods suggested for setting up a database:

GAPS: or generator start points: Each station is established with the generators for each table set to a value that is a gap from the other server. e.g. two stations one starts at 0 the other could start at -Int32 (-2,147,483,648).
Using a system I run as an example: Instead of imagining a peak rate of inserts, I look at last year’s activity in the most used/abused table and divide by 31.5 million seconds per year. In my calc I get a peak rate of inserts in a 2 month period of 1.2 million records and the rest of the year is very quiet – they are working on the items, scheduling them etc etc. That’s 4 inserts per second for the peak period but still calculates to 1,700 years of inserts on one station with Int32 in the positive range. 3,400 years if I then use the negative range as well. If I setup gaps in the four stations and use the entire range I will get 850 years across the four stations. Now – if I use Int64 (9,223,372,036,854,775,807) – heck that’s a big number! I always have to use another calculator for this… anyway you get the picture – you do the math.
Unfortunately, this system was built in 1997 and relies on the BDE. I have no chance of easily migrating it to Int64 or indeed any other means or delivery Int64 keys. The client will someday have to make a decision to retire the system or pay for a relatively major upgrade. I do, however, see that the BDE will be the cause of such a decision. It won’t be that we have run out of integer values.

STATION NUMBERS: This way you setup your generators to generate PK per say 10000 gaps.
e.g. gen_id( gen_loc_id, 10000 ); For each station you set the generator off at the next increment station one starts at 1, station two starts at 2 etc but they all jump by 10,000 so station 1 goes 10000, 20000, 30000 etc and station 2 goes 10001, 20002, 30003 etc. You can then always tell the station origin of the record.

Now apply your expected yearly insert rate to it and you will find your life cycle. Using my system above, at my insert rate, I can change to this system, enlarge it to 10,000 stations capability and still get approx 826,000,000 years of operation and that’s only in the positive range!!! – go negative and get 1.7 billion years – what do you think?

COMPOSITE PKs: This method is used where the first field of a composite is the generated integer value as normal, and the second part is the station number. Thus when records from multiple station numbers are combined, the records will retain their uniqueness even though the first field may experience many duplicates.

This solution may cause more hard work to the client application to make it work. The first two alternatives above require relatively little work to the client applications.
UUID: This method requires that at least you use the reverse UUID variant of this value, such that the indexing is most efficient i.
You should also pay attention to the page size of the database, to ensure that your index pages are being used efficiently ii.

You may also need to bear in mind that UUIDs do not offer a natural order in the sequence of insertion. You will need a record timestamp as well to do this job. This is no big deal since I always have DATECREATE and DATEUPDATE fields in my tables.

I hear what people say about using GUID/UUIDs but under many circumstances, int64 will outlive 1,400 years by a long shot.

Another limitation worth mentioning is the bottom line of the server itself which has a final upper limit of insert rate with current hardware, CPU, hard disk speed etc. I’ve seen numbers but I can’t remember at this point but I can imagine that 1000 inserts per second is about as good as you get using a pumping method from external tables. Using human record creation methods is, of course, much slower. For systems which need to endure high rates at peak intervals, the replicator will need periods of low activity to catch up.

Footnotes
i Ian A. Newby wrote:
If anyones interested, I’ve got an uuid udf for Firebird. It is both linux and windows, and can create both 38 har GUIDs or 22 char UUIDs (the 22 char version is a compressed guid with the byte order reversed to make it index better.) There are also functions to convert between the two forms.

iiDavid Johnson wrote:
Straight GUIDs (e.g. generated by Delphi written application) has their significant part at the end which does not provide for optimal indexing. Use of the UDF generating UUID (reverse GUID) is said to improve performance over the PK dramatically.

With small index column sizes you need much more detail than I have at my fingertips to compute a real value, but let’s say that you go to a 4 byte integer key. You lose 34 bytes from the index entry footprint, giving 50-34=16 bytes per entry plus some overheads. On an 8k page this gives you roughly 500 entries. At two levels in the B+ tree you address 250,000 rows. At three levels you can address 125,000,000 rows.

If you table space is expected to grow to 4,000,000 rows then you still have to expand the index to three levels, but you can go a lot further on three levels than you can with a larger key.