Basics: The Definition of Endianness
Wikipedia defines endianness as "the ordering of components (such as bytes) of a data item (such as an integer), as stored in memory or sent on a serial connection." When a software professional talks about endianness, he is referring to some structured binary data which may be interpreted differently depending on how the individual bytes are ordered, just as date notation has a different meaning depending on whether you expect 01-12 to be the 1st of December or the 12th of January. This is the kind of thing most people don't normally think about until something unexpectedly goes wrong - in life and in software.
Whenever an Intel processor reads a 4-byte unsigned integer, it expects the first byte to represent the total number modulo 2^8, the second byte - over 2^8, modulo 2^8, etc. For example:
A0 FF 7D 14 = A0(16) + 256(10) * FF(16) + 65536(10) * 7D(16) + 16777216(10) * 14(16) = 343,801,760(10)
An AMD processor expects the opposite - there, the same byte sequence would signify the number 2,701,098,260. On the other hand, if the same bytes were read as two adjacent 2-byte unsigned integers, the result would be A0 + 256*FF, 7D + 256*14 on a Pentium and 256*A0 + FF, 256*7D + 14 on an Itanium or a PlayStation 3. The specific endianness-es of the various architectures and the reasons for the different conventions are well described elsewhere, so we will not dwell on them here. Note that in the example above, if the data was read as 4 adjacent single bytes, the result would be the same regardless of the CPU.
Standard networking protocols use big-endian notation for integer fields such as port numbers, while the data itself has no predefined endianness other than the one assumed by the application. Various data formats and devices have conventions about how numeric data (i.e., all data) is to be stored, which occasionally leads to painful complications for the unexpecting software engineers.
Insight: The Meaning of Endianness
In its essence, "endianness" is the way in which the CPU interprets binary sequences in memory. The processor has no concept of "number", unlike humans who associate numbers with intuition and experiences. However, if we design a bijective correspondence between numbers (which are interesting to us) and binary strings (which the CPU can handle), we can use the processor to manipulate the binary strings according to its instruction set, and then we can decode the results back into abstract numbers - or have IO devices help us by representing the bytes in some other fashion, e.g.. as characters on a screen.
In this sense, the endianness of a CPU is the way it handles strings of bytes with respect to the basic arithmetic operations - and some basic primitives such as comparison, jumping and memory addressing. Everything else is built on top of that - looping and branching, array indexing, functions, input and output, and operates correctly so long as the CPU's operations honor the bijection.
Basically, endianness is an encoding between (abstract) numbers and (concrete) binary strings in the same way that UCS is an encoding between (abstract) characters and (abstract) numbers and UTF-8 is an encoding between (abstract) numbers and (concrete) sequences of bytes.
This is why, just like a piece of text data is meaningless unless we know its encoding and language, binary data is meaningless without knowing the endianness with which it was encoded - even if we know that such and such bytes represent a 4-byte signed integer, we can't tell if it's the number 2 or 33,554,432.
Practical issues: When does endianness matter?
Short answer, when doing IO. In the case of user input/output, it's not much of a concern, since your OS and device drivers will transparently take care of the matter and make sure you get the correct byte strings in memory. Regardless of the technology stack you're working in, a correctly working program working on properly represented input data will produce correct, properly represented output. No one (except for the OS), needs to know how the CPU does arithmetic on binary strings. Once you have a properly constructed data object, the logic will work.
However, when dealing with files and networks, if the data is in an endianness-sensitive format and originated on another computer with different or unknown endianness, you need to take steps to insure it's properly interpreted at your end. Just as for security you need to watch your input and sanitize possibly tainted data, for correctness you need to keep track of which data may have been encoded on a device with different endianness, and process it accordingly before feeding it into the business logic of your application - and you should also be aware when some supposedly helpful intermediate component does the conversion for you so you won't corrupt your data by processing it again.
The other case you need to keep endianness in mind is when for some you're reading raw memory. Here's a tiny C# program for testing your processor's byte endianness:
using System; // OR use BitConverter.IsLittleEndian class Program { unsafe static void Main() { int num = 1; bool le = *(byte*)&num == 1; Console.WriteLine("{0} endian", le ? "LITTLE" : "BIG"); } }
I plan to give some examples of when endianness-related issues in practice a bit later later.
As a footnote, UTF-16 "Big Endian" and UTF-16 "Little Endian" have nothing to do with CPU endianness - they're two number-to-bytestring encodings. The fact that the bytestrings are related so that wrong assumed encoding and different CPU endianness cancel each other out is in this sense almost a coincidence - so stop being confused.
Edit history:
2013-03-14 - I'd wrongly stated that AMD processors tend to be big-endian. I now know that all x86/x64 CPUs are little-endian, as are most modern achitectures - in addition to bi-endian ones (sic) such as SPARC, ARM and PowerPC. Also corrected misleading statement about data endianness in network transmissions.
No comments:
Post a Comment