The Usefulness of Bit Vectors in Data Encoding


On January 07, 2001, Elrac said:

Bit fields are a natural fit for representing sets. Sets are collections of things in which some element is either present, exactly once, or not at all. Bit fields are great for encoding things like True/False questionnaires.

My basic scenario here is from one a spy movie where an average Joe is recruited from the population of America because the CIA needs a person with a very specific set of talents for a mission, which involves impersonating a person with just these talents, or something similarly far-fetched. Of course the Census has gathered information on the skills of everyone in the nation for just such a purpose.

You'd have (say) 8 questions like:
  1. Can you handle hyperbolic trigonometry?
  2. Are you able to weave baskets underwater?
  3. Do you practice Applied Magic?
  4. Do you understand the Special Theory of Relativity?
  5. Can you ride a horse bareback?
  6. Do you speak fluent Esperanto?
  7. Can you change a light bulb with your hands tied behind your back?
  8. Have you navigated a nuclear submarine before?
A set of answers would look like this: Y, Y, Y, N, N, Y, Y, N.

If you wanted to store the answers in a database (for whatever purpose), you could set up 8 fields of 1 character. That's very straightforward but a bit wasteful. If you use single bit fields, you wouldn't even be wasting too much space, and you could query the answers flexibly. But for our spy mission, we aren't interested in anything except a perfect match on some subset of these 8 skills. And because the fate of the nation is in the balance, the query must be very quick. Ideally, a match should be determined by a simple and quick operation on a single field.

Take our answer set: Y, Y, Y, N, N, Y, Y, N. The answers are already in binary, so it would make sense to encode them in binary, as: 11100110. But of course a bit sequence like this looks a lot like an integer in binary. If you punch 11100110 into a calculator that "does" bases (Microsoft's Calculator applet does, in Scientific mode), you can convert it to 230 decimal.

Converting binary to decimal is pretty easy: Just assign a power of 2 to each bit. If I want things to work as they do in the Calculator, I have to assign 128 to the first question, 64 to the second, 32 to the 3rd and so on to 1 for the last one. Here's a table:
Q# Value
1 128
2 64
3 32
4 16
5 8
6 4
7 2
8 1

Time to step back and consider why one would want to DO such gyrations. Let me remind you of my premise/promise that a number of conditions could be tested with a single quick operation. Oops... Mission Control is calling. They need a bareback horse rider and nuclear submarine pilot who can change light bulbs with his hands tied behind his back. Let's see: That's questions 5, 7 and 8, or binary 00001011. The decimal representation is found by simply adding the decimal values of the bits, or 8 + 2 + 1 = 11. Now... let's write down the answer set from above, and our requirements set from just now, below each other:
11100110
00001011
In order for someone to qualify, he needs to have 1's at least in those places where the requirement has 1's. This is where the bitwise AND comes in. Many language have this operator because it's easy for a CPU to compute and it's often handy for stuff like this. SQL Server uses a '&', C uses '&&.' The bitwise AND returns a result where there are 1's in all those places where there are 1's in the same column in both operands. Let's work this out for our example:
  11100110
& 00001011
----------
= 00000010
OK, our result has a single bit set, it's the match on the light bulb changing. But there's no match on the other two qualities, so the result is mostly 0's. In decimal, this trial comes out as: 230 & 11 = 2. Enter Albert Schwarzenegger, who answers 'YES' to all the questions:
  11111111
& 00001011
----------
= 00001011
In decimal: 230 & 11 = 11. So... even the most qualified applicant never gets more bits set than there are in the requirement. After all, we don't care about the other bits. What this shows us is how to formulate the query for the requirement: In order for an applicant to meet ALL the requirements, (applicant & requirement = requirement).

This shows us what bit vectors are best at: Quick selections on a bunch of yes/no criteria, where ALL the required 'Yes' criteria need to be met for a match. A simple test made on a number stored in a single field tells us if this criterion is met or not.

This principle can be extended to test for 'no's by using bitwise OR ('|') on the operands, or inverting some or all of the bits of the operands using the Exclusive OR ('^') operator. Softer criteria can be obtained by combining several such queries with logical OR... but there comes a point where the queries are no longer more efficient than queries on other, more common forms of data representation. But for a reasonable number of Yes/No criteria (remember, an integer will only hold 32 bits!) or criteria with a small number of possible values (which again can be coded into a few yes/no bits), this is the most efficient method of storage and querying.
Back to Table Of Contents