Compiling with Types, or going back to the stone-age w/ object (or pointers in sheep’s clothing)

I never took compiler design in college. I quite frankly didn’t think it very important. Why would I need to write a compiler? Well, maybe in today’s world where data is sent in Text encoding, compiler ideas might help a little.

What if you have no idea what data type, you are trying to receive? If you’ve done a first year programming class, you probably remember:

int* a = 1;

This is a typed pointer. This means the variable a contains a number (unsigned 32bit or 64bit, depending on the OS type) that corresponds to a memory address in RAM. At that location is, (assuming the language specification says a int is 32bits), 4 contiguous bytes corresponding to a 32bit integer (probably in 2’s complement representation and, in the endian-ness of that hardware platform).

Endian-ness is important b/c the adder expects the digits to be in a specific order.

We see 123+456:

                   <-----
      Carry-over   111111
 123   ‭          000001111011‬
+456   ‭          000111001000‬
----
 579   ‭          001001000011‬

But storage in a computer is arbitrary, so the add circuit might expect the bits (computer digits) in the 32bit register to be ordered "in reverse". Here the digits are reverse, and the least significant digits on the left side (still unchanged from above, the 3, 6, and 9), and hand addition is done in reverse:

                     ------>
      Carry-over     1111111
 321              110111100000 
+654              000100111000 
----
 975              110000100100 

but when you read it to make sense to you as a human, you have to reverse the digits again. The computer when displaying to you translates the value to text "579", and displays it to you.

A register in the CPU is the place where the CPU puts a number. To add a number, the CPU needs at least 2 registers. To add (2x) 32 bit, signed numbers, it needs a add circuit, that can take the digits of (2x) 2's complement signed number, and produces digits into another 2's complement signed number. When the CPU is instructed to add 2 numbers, it is told the circuit to use (instruction), which registers contain the operand values, it connects the right registers to the add circuit, and outputs it on the other side.

            (0)     +---------+
[000123]----(1)---->| Add     |----(1)--->[000789]
[000456]----(2)---->| Circuit |
[??????]----(3)     +---------+

It's supposed to put result sum back into the first register. No, I do not know how the add circuit replaces one of it's operand registers with the result value (A:=A+B, 
add eax,ebx).  But I think that's the case.  I don't think it important to solve that mystery, to understand the idea of endian-ness as it pertains to binary addition.

      +-------------+
0 --->|             |--->0
0 --->|             |--->0
0 --->|             |--->1
0 --->|             |--->0
0 --->|             |--->0
1 --->|             |--->1
1 --->|             |--->0
1 --->|             |--->0
1 --->|             |--->0
0 --->|             |--->0
1 --->|             |--->1
1‬ --->|             |--->1
      | Add circuit |
0 --->|             |
0 --->|             |
0 --->|             |
1 --->|             |
1 --->|             |
1 --->|             |
0 --->|             |
0 --->|             |
1 --->|             |
0 --->|             |
0 --->|             |
0‬ --->|             |
      +-------------+

Each zero and one is a digit in binary. Physically, that zero and one is either 0v or (as an example) 1.1v, respectively. The computer does with transistors, just as we learn in school to add the least significant digit, and carry over if it goes over 10, computer circuits do this in binary. Inside this add circuit are logic gates, to reproduce addition as we know it, in binary. And 32bit addition, has 64 electrical lines going into the Add circuit, and 32 lines going out, to a result register. I've assumed that this can all be done in one cycle, or the amount of time it takes for the electricity to flow completely from one side to another (analogously open gates of water and waiting for the water to rise to match the water level, or close it and connect to drain to drop to zero).

The carrying over process has a direction (shown by the arrow), that always goes from least significant digit, to next significant. The circuit expects an endian-ness (or analogously reading left to right, or right to left).

Typed pointers are a construction of compiler design. There is no such thing to a computer CPU. A pointer to a int, might as well be a pointer to a null-terminated string. The type is there, for the compiler to help to prevent you (the programmer) from making mistakes. If you say you intended to use a 32bit integer (let's say analogously 4digits), and later in the program, try to write it to memory in a place that intended to hold 16bit integers (let's say analogously these hold 2 digits each), you're probably going to be overwriting 2 numbers (writing 4 digits of 0012, over 2 numbers 13,14). The type is for the compiler to tell you "You messed up. You told me you were sticking a 32bit number, and now you're trying to put it in a 16bit number".

Typed objects (structs) are the same thing. It's just packed data types, saying first 4 bytes are Length, next 4 bytes are the special values used, next 8 is the location of the variable length value.

struct selfmadestring {
   uint32 Length; //unsigned integer in C# is 4 bytes
   unicodechar[2] ControlChar; //Unicode character used to separate values.  Unicode is 2bytes, so 4 bytes, can hold 2 char.
   char* string; //an memory address in RAM, that is contiguous (supposedly according to code, as long as [.Length above] bytes long)
}

4+4+8 is 16bytes.

If we declare a variable of this custom written data-type selfmadestring* text, saying to the compiler, to expect the variable text to contain a memory address, pointing to a location with 16 contiguous bytes (which I will colloquially call a bytepack). The type selfmadestring says it contains 3 different values. If you instructed the compiler to write a "selfmadestring" there, it should contain a length, control chars, and a memory address to another string. BUT it CAN contain ANYTHING. In fact, a lot of languages will allow you to "change types" for a memory address. Sometimes, it's convenient to do so. If johnnystring* text2 has the exact definition as "selfmadestring", some compilers (not C# bc the type information is part of the byte pack) will allow you to assign that memory address to a different type text2=text, if you make some assurance to the compiler that you know what you are doing. The memory address that the variable "text" held, is now assigned to a variable text2 and told by compiler to interpret as a "johnnystring". This is a what programmers intend to do, when they "change types". Of course, programmers make mistakes. It's entirely possible that johnnystring* text2 has an completely different definition, such as the first 8 bytes are another memory address, in which case, any code that extracts the length to be in the first 4 bytes, is going to try to read it as a length with unpredictable results, but definitely wrong. So when you make this "assurance", the compiler will defer to your understanding, even if you are making a mistake (Again, not C#, bc the types are part of the bytepack, the runtime will not allow you to do this, but other less "error checking built in" language technologies will).

So, what if you want to be able to move any "compatible" data type, into another "compatible" datatype? With higher level languages, we tend to forget a lot about the bit-ness of a number. Compilers such as C#, will allow you to assign a int (32bit signed), into a long(64bit signed), long l = (int)5;. This is bc the compiler does what I've heard others call called "syntatical sugar" by a compiler. It will insert a call to a converter for you, long l = ExampleConvert.ToLong((int)5);, bc the compiler knows what you want to do, and it doesn't bother telling you the obvious. Why does it need to do this? b/c they use different amount of digits. For a computer -0009, isn't the same as -00000009. I'd show you what -9 looks like in binary, but I don't really remember exactly what 2's complement is. Just that it makes binary arithmetic easier for a circuit.

Overloading in a language, I suspect, is handled the same way. Overloading is naming a method (function) with the same name, but varying the number of arguments, order, and types. The compiler knows what you meant to do. It just doesn't bother telling you anymore (that the functions are actually different to it).
So if you had 3 methods declared with same name, but different (unambiguous) argument types

static string ConvertToString(int item)
{
     return "Number is " + item.ToString();
}
static string ConvertToString(double item)
{
     return "Deciaml is " + item.ToString();
}
static string ConvertToString(DateTime item)
{
     item.Millseconds=0;
     return "Date is " + item.ToString();
}

The compiler knows exactly which one you mean, when you write int x=1;Console.WriteLine(ConvertToString(x));. "x" is declared to the compiler as a int. An int in c# is a 32bit signed integer. Which means it has 31 binary digits, and one digit is the sign. So when it reads it, it reads 4 bytes and interprets it as a number.

But it's not all about size. A date in c# is also 32bits in size, but in c# a date is the number of milliseconds from (let's call it the beginning of computer time), so the computer needs to interpret the bits differently. It is entirely possible, if the c# program needs to represent dates billions and billions of years ago, that it cannot do it with the standard c# datetime. But I've never had a need to do this. And I doubt anyone else does, either.

So when the compiler sees int x=1;Console.WriteLine(ConvertToString(x));, it knows to plug in the method "address" for ConvertToString(int) (if you know basic, this is like a goto statement), not ConvertToString(DateTime).

The compiler does all of this for you. You don't even have to think about it.

But! What if we have no idea what the format(s) is, when we write the code? The solution using the compiler function overloading as your primary tool, is to write the vast majority of possible formats, that you as a programmer might use. That is overloading. But the compiler can only do, what it is told. And the compiler only knows what the programmer can teach it. What if the programmer doesn't know what datatype his program is going to be called with? And it cannot be checked by the compiler.

There are a other things we can do, to handle the same tasks for different datatypes. Below is a breakdown of different ways we can do this, in a simple program which we try to print something (C# already has a specification for this same functionality, called .ToString() which is analogously the last option we'll go over, but we're recreating the wheel here as a demonstration exercise). The first option is that we can hope that the programmer knows exactly what types it has to work with, and that the library has overloads for that exact type, and so the compiler can create the right "goto statement" to the right existing method during compilation. Meaning that goto address, never changes, bc the type that it encounters, never changes. Those are the times recorded for end1 to end4 variables in the program.

A method that you can use in .NET (C# is built on) bc all of the types are packed with information on what type it is. You can ask any variable .GetType() and it will tell you. We use the "is" operator for similar functionality. So we ask, are you a int, then use this function. That methodology is tested in end5, and end7 variables. end7 variable only uses operations on int type, so you can see apples to apples compared to end1, which is also only has operations on int types only. (Converting a date and a decimal to text seems to take a long time, comparatively)

The last way (the variable end6) is object oriented way of handling this, is to create a special class that knows how to handle itself. And create a special method for itself. In this case, it looks for a method in that type's method table, and then it calls it. Basically, the idea is that it always has 2 lookups (compared to a non-changing goto address). Above in previous example, it can be a long chain of if-else-if. The end8 variable is only running int datatype, so you can do apples to apples comparison of the performance of end1, end7, and end8, to see what the difference is, since they all operate on integers, the difference should be the methodology of handling different types.

class Program
    {
        static void Main(string[] args)
        {
            int COUNT = 1000000;

            Console.Write(".");
            var start1 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                object value = i;
                var text = ConvertToString((int)value);
            }
            var end1 = Environment.TickCount - start1;

            Console.Write(".");
            var start2 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                object value = DateTime.Now;
                var text = ConvertToString((DateTime)value);
            }
            var end2 = Environment.TickCount - start2;

            Console.Write(".");
            var start3 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                object value = (double)i / Environment.TickCount;
                var text = ConvertToString((double)value);
            }
            var end3 = Environment.TickCount - start3;

            Console.Write(".");
            var start4 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                object value = (decimal)i / Environment.TickCount;
                var text = ConvertToString((decimal)value);
            }
            var end4 = Environment.TickCount - start4;

            Console.Write(".");
            var start5 = Environment.TickCount;
            Random rnd = new Random();
            for (int i = 0; i < COUNT; i++)
            {
                object value = null;
                var type = rnd.Next(4);
                if(type==0)
                    value = i;
                else if (type == 1)
                    value = DateTime.Now;
                else if (type == 2)
                    value = (double)i / Environment.TickCount;
                else if (type == 3)
                    value = (decimal)i / Environment.TickCount;

                var text = FindConverterAndToString(value);
            }
            var end5 = Environment.TickCount - start5;


            Console.Write(".");
            var start6 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                A value = null;
                var type = rnd.Next(2);
                if (type == 0)
                    value = new B() { Number=rnd.Next() };
                else if (type == 1)
                    value = new C() { Date = DateTime.Now };

                var text = value.ToConsoleString();
            }
            var end6 = Environment.TickCount - start6;


            Console.Write(".");
            var start7 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                object value = i;
                var text = FindConverterAndToString(value);
            }
            var end7 = Environment.TickCount - start7;


            Console.Write(".");
            var start8 = Environment.TickCount;
            for (int i = 0; i < COUNT; i++)
            {
                A value = new B() { Number = rnd.Next() };

                var text = value.ToConsoleString();
            }
            var end8 = Environment.TickCount - start8;



            Console.WriteLine();
            Console.WriteLine("Compiler branched to f(int)..................................{0} ms", end1);
            Console.WriteLine("Compiler branched to f(DateTime).............................{0} ms", end2);
            Console.WriteLine("Compiler branched to f(double)...............................{0} ms", end3);
            Console.WriteLine("Compiler branched to f(decimal)..............................{0} ms", end4);
            Console.WriteLine("Compiler branched to f(object) --> f(type)(unbox(object))....{0} ms", end5);
            Console.WriteLine("Compiler branched to get(f) --> DateTime/int.f().............{0} ms", end6);

            Console.WriteLine("Compiler branched to f(boxed(int)) --> f(int)(unbox(int))....{0} ms", end7);
            Console.WriteLine("Compiler branched to get(f) --> int.f()......................{0} ms", end8);

            Console.ReadLine();
        }

        static string ConvertToString(int item)
        {
            return (item+1).ToString();
        }
        static string ConvertToString(Guid item)
        {
            return item.ToString();
        }
        static string ConvertToString(double item)
        {
            return (item+1d).ToString();
        }
        static string ConvertToString(decimal item)
        {
            return (item+1m).ToString();
        }
        static string ConvertToString(DateTime item)
        {
            return (item.AddSeconds(1)).ToString();
        }

        delegate string converter(object item);
        static string FindConverterAndToString(object item)
        {
            converter Tostring = null;
            if (item is int)
                Tostring = IntToString;
            else if (item is Guid)
                Tostring = GuidToString;
            else if (item is double)
                Tostring = DoubleToString;
            else if (item is decimal)
                Tostring = DecimalToString;
            else if (item is DateTime)
                Tostring = DateTimeToString;
            else
                throw new NotSupportedException("Never expected to create a string from " + item.GetType().Name);

            return Tostring(item);
        }
        static string IntToString(object item)
        {
            return (((int)item)+1).ToString();
        }
        static string GuidToString(object item)
        {
            return ((Guid)item).ToString();
        }
        static string DoubleToString(object item)
        {
            return (((double)item)+1d).ToString();
        }
        static string DecimalToString(object item)
        {
            return (((decimal)item)+1m).ToString();
        }
        static string DateTimeToString(object item)
        {
            return (((DateTime)item).AddSeconds(1)).ToString();
        }

        
    }

    abstract class A
    {
        public abstract string ToConsoleString();
    }
    class B : A
    {
        public int Number;
        public override string ToConsoleString()
        {
            return (this.Number + 1).ToString();
        }
    }
    class C : A
    {
        public DateTime Date;
        public override string ToConsoleString()
        {
            return (this.Date.AddSeconds(1)).ToString();
        }
    }

The above program produces on a Intel i5-4308U (Remember, it's not the actual numbers, but relative performance to each other, that we are looking at):

Compiler branched to f(int)..................................594 ms
Compiler branched to f(DateTime).............................7266 ms
Compiler branched to f(double)...............................1093 ms
Compiler branched to f(decimal)..............................2735 ms
Compiler branched to f(object) --> f(type)(unbox(object))....3219 ms
Compiler branched to get(f) --> DateTime/int.f().............4140 ms
Compiler branched to f(boxed(int)) --> f(int)(unbox(int))....656 ms
Compiler branched to get(f) --> int.f()......................782 ms

As you can see, the performance, is kinda in line with what we are expecting. The analogy is:

With end1 running faster than end7 and end8. end1 basically has a goto statement with a line number that never changes.

end7 has to get a goto statement's line number from a if-else-if.

And end8 has a "relative goto number". It has to goto the class instance address first. then takes the "relative goto number" from there to find the actual goto number. And then it "goes to".

But you'll notice, there is very little difference. But if you're running repetitive actions to the tune of billions of iterations. end7 is 10% longer than end1. end8 is 31% longer than end1. So if you're processing pixels, you probably know which one is better. But if you're just processing a few thousand, you may never see the difference paid back, compared to how many lines of code you have to write to support different formats.

So we are going back to goto statements, which are abhorred by programming teachers, but in a way that is organized where the compiler can help tell you "I think you made a mistake here".

Leave a Reply

Your email address will not be published. Required fields are marked *