2024.01 Vol.3

Bit of number base represenation

What Does Seven Look Like?

I got a little turned around thinking through how to convert a number from any base to any other base. And then began wondering if there are any performance impacts of such algorithms given how computers are implemented using base_2. I had to take a step back and remember what is going on.

First, a number based in anything is just a representation. The number itself is an abstract concept. We are so used to thinking in base_10 (decimal) that I think we can be forgiven if we start to blend the representation with the number itself. But the number seven is represented as 7 in base_10 and 111 in base_2, but it is just seven in both cases. We humans have just given these symbols meaning in different contexts. I am not sure if we can even visualize seven without a representation. Even something like |||||||, like lying seven sticks on the ground, isn’t that just seven in base_1?

The algorithm for how to derive a number from whatever base it is very intuitive to me for whatever reason.

n = d_k * b^k + d_(k - 1) * b^(k - 1) + ... + d_1 * b^1 + d_0 * b^0

Number n with digits d_* representing it in base b.

The weird part though is…what base is n in then? Feels like a meta question of the function. And the answer is kinda meta: whatever base you do the arithmetic in. So we are used to seeing base_10 here a lot.

What if we want to find the representation of a number in a non-base_10 base, but we do not want to do the arithmetic in that non-base_10 base? I think the function above makes sense to me because it is easy to see that a digit further left is in a “higher” bucket and needs to be given its appropriate value to derive the number. If we instead already have a number and want to find the digits given a base, well then another way to say that is we have to find the amounts per-bucket (the d_*).

Given the above equation it is pretty clear we need to divide by the base b in order to calculate the d_*. But how exactly? Refreshing on division real quick.

a = q * b + r, where 0 <= r < b  

The division algorithm for a dividend a and divisor b, there is a quotient q and a remainder r.

Really spelling out that r constraint, if r = b then a = q * b + b = q * (b + 1). The remainder is less than b because else then the divisor could just “fit” one more time into the dividend. Now let’s line this up with the digits function from the top.

n = d_k * b^k + d_(k - 1) * b^(k - 1) + ... + d_1 * b^1 d_0 * b^0 
  = d_k * b^k + r, where 0 <= r < b^k

The remainder method.

That second half of the equation makes it clear that the base value b^k is the divisor and the quotient d_k is the digit for that “bucket”. The remaining value, r = n - (d_k * b^k), can then be divvy’d up in the remaining, smaller sized buckets (e.g. b^(k - 1) and beyond). In practice, this is just long division, we calculate the remainders and then reverse them.

Since we humans are so used to doing arithmetic in base_10, I think it is pretty common when converting bases to use base_10 as an intermediary even though it is not technically necessary. I am wondering though if we should care about that extra step in computer-land where sometimes every cycle counts.

Computer programs have two common data types: strings and numbers. Strings are more general, just a series of symbols. They can hold any data, including a number represented in any base (e.g. “1010” or “9”)! The computer isn’t aware of the context, it is up to the program to give it meaning. Number data types are more specific, designed to hold just numbers and do number things (e.g. arithmetic). Since numbers are more limited they usually offer higher performance and such since the space can be optimized. The computer can be more helpful.

So a string type can be any base representation of a number, whereas the number type is the computer’s representation of a number. The computer is probably using base_2 under the hood somewhere since all things are digital. But does this make things simpler to think about? If a program is operating on input which is a rarely used base (let’s say base_58) and it wants to do any sort of number things with it, step one will be to get it into the computer’s number representation, so it can take advantage of all the things computers are good at. There is no way around weird base -> computer base -> weird base dance (assuming output is in the weird base).

Another use case for a weird base is to store random data as string since string symbols are a common interface between systems. In this case, a program wants to get the weird base into base_2, so it can parse the data into the format it knows. Lucky for the program, base_2 is also what the computer is using for numbers, so parsing it into a number should be one and the same step. Just remember if it’s stored in big endian or little endian since that will flip things!