Byte

From Citizendium
Revision as of 20:43, 23 April 2007 by imported>Robert Tito
Jump to navigation Jump to search
The Hexer hex editor displaying the Linux kernel version 2.6.20.6; this image illustrates the value of bytes composing a program as they appear in the hexadecimal format

In computer science, a byte is a unit of data consisting of eight binary digits, each of which is called a bit. The 8-bit byte is the smallest addressable unit of information in the instruction set architecture (ISA) of most electronic computers today.

When grouped together, bytes can contain the information to form a document, such as a photograph or a book. All information stored on a computer is composed of bytes, from e-mails and pictures, to programs and data stored on a hard drive. Although initially it may appear to be a simple concept, the actual definition is far more complex and profound.

A byte is really just an 8-digit binary integer. The semantics (meaning assigned) for a given byte is a matter defined within the instruction set architecture (ISA) of each type of computer. Many different encodings have been tried for text, for integers, and for various non-integer numbers. A discussion of the merits of the various representations is complex and falls under the field of computer architecture.

In the early history of computing, experimental computers were built containing alternates numbers of byte sizes, including (for example) seven bits (binary digits) per byte.

Definition of byte

In electronics, information is represented by the toggle of two states, usually referred to as 'on' and 'off'. To represent this state, computer scientists use the values of 0 (off) and 1 (on); we refer to this value as a bit. Half of a byte (four bits) is referred to as a nibble. A word is a standard number of bytes that memory is addressed with. Memory can only be addressed by multiples of the size of a word, and the size of a word is dependent on the architecture. For example: a 16-bit processor has words consisting of two bytes (8 x 2 = 16), a 32-bit processor has words that consist of four bytes (4 x 8 = 32), etc.

Each byte is made of eight bits, which can represent any number from 0 to 255. We obtain this number of possible values, which is 256 when including the 0, by raising the possible values of a bit (two) to the power of the length of a byte (eight); thus, 28 = 256 possible values in a byte.

Bytes can be used to represent a countless array of data types, from characters in a string of text, to the assembled and linked machine code of a binary executable file, which is the language that programs use to tell the computer how to act. Every file, sector of system memory, and network stream is composed of bytes.

Perhaps the oldest formation of bytes was plain text (without any punctuation) as used in telegrams. To compensate for the absence of basic punctuation, telegrams in times past would often use the word "STOP" to represent a period.

In computers, plain text came to mean a string, file, or byte array that is printable, consisting only of standard alphanumeric bytes and a few control bytes such as tab, carriage return, or line feed. Plain text was not supposed to include any bytes that a printer would not know how to handle. The actual value of each character has varied in years past. Today, however, we have the American Standard Code for Information Interchange (ASCII), which allows data to be readable when being transmitted through different mediums, such as from one operating system to another. For instance, a user who typed a plain text document in Linux would usually be able to view or print the same file on a Macintosh computer. One example of ASCII would be the capital letters of the English language, which range from 101 for "A" to 127 for "Z".

Endianness

When multiple contiguous bytes represent a single number, there are two possible opposite "orderings" of the bytes; the particulur ordering used is called endianness. Just as some human languages are written from left to right, such as English, while others are written from right to left, such as Hebrew, bytes can be arranged "big end first" (with the most significant bit in lower memory) or "little end first" (with the least significant bit at the lower memory address). (Compare +1000 and 1000_+ to denote 1000 in big endian way and little endian way, respectively.)

Suppose we are writing a program that uses the number 1024. In this example, the number 1 is considered to be the most significant byte. If this byte is written first, that is, in the lowest memory sector, then we are using the 'Big Endian'. If this byte is written last, or in the highest memory sector, rather, then we are using the 'Little Endian'. Interestingly, these names are derived from the book Gulliver's Travels, in which the Lilliputians' forefront political concern was whether eggs should be opened from the little end or the big end. This story runs in parallel with this concept we have today in that the reason we may use one and not the other in some cases is not necessarily based on any technical information, but on politics.[1]

This is typically not a problem when dealing with the local system memory since the endianness is determined by the processor's architecture. However, this can pose a problem in some instances, such as network streams. For this reason, a networking device must specify which format it is using before it sends any data. This ensures that the information is read correctly at the receiving end.

Word origin and ambiguity

Although the origin of the word 'byte' is unknown, it is believed to have been coined by Dr. Werner Buchholz of IBM in 1964. It is a play on the word 'bit', and originally referred to the number of bits used to represent a character.[2] This number is usually eight, but in some cases (especially in times past), it can be any number ranging from as few as 2 to as many as 128 bits. Thus, the word 'byte' is actually an ambiguous term. For this reason, an eight bit byte is sometimes referred to as an 'octet'.[3]

Sub-units

Because files are normally many thousands or even billions of times larger than a byte, other terms designating larger byte quantities are used to increase readability. Metric prefixes are added to the word byte, such as kilo for one thousand bytes (kilobyte), mega for one million (megabyte), giga for one billion (gigabyte), and even tera, which is one trillion (terabyte). One thousand megabytes compose a terabyte, and even the largest consumer hard drives today are only three-fourths a terabyte (750 'gigs' or gigabytes). The rapid pace of technological advancement may make the terabyte commonplace in the future, however.

Conflicting definitions

For more information, see: Binary prefix.

Traditionally, the computer world has often used a value of 1024 instead of 1000 when referring to a kilobyte. This was done because programmers needed a number compatible with the base of 2, and 1024 is equal to 2 to the 10th power. Typically, storage space is measured with a base of 2, whereas data rates generally uses a base of 10. Thus, engineers in one field of computer science may use the same term when referring to different units of measurement (numbers of bytes).

Due to the large confusion between these two meanings, an effort has been made by the International Electrotechnical Commission (IEC) to remedy this problem. They have standardized a new system called the 'binary prefix', which replaces the word 'kilobyte' with 'kibibyte', abbreviated as KiB. This solution has since been approved by the IEEE on a trial-use basis, and may prove to one day become a true standard.[4]

While the difference between 1000 and 1024 may seem trivial, one must note that as the size of a disk increases, so does the margin of error. The difference between 1TB and 1TiB, for instance, is approximately 10%. As hard drives become larger, the need for a distinction between these two prefixes will grow. This has been a problem for hard disk drive manufacturers in particular. For example, one well known disk manufacturer, Western Digital, has recently been taken to court for their use of the base of 10 when labeling the capacity of their drives. This is a problem because labeling a hard drive's capacity with the base of 10 implies a greater storage capacity when the consumer may assume it refers to the base of 2. [5]

Table of prefixes

Metric (abbr.) Value Binary (abbr.) Value Difference* Difference in bytes
byte (B) 100  = 10000 byte (B) 20  = 10240
0%
0
kilobyte (KB) 103  = 10001 kibibyte (KiB) 210 = 10241
2.4%
24
megabyte (MB) 106  = 10002 mebibyte (MiB) 220 = 10242
4.9%
48,576
gigabyte (GB) 109  = 10003 gibibyte (GiB) 230 = 10243
7.4%
73,741,824
terabyte (TB) 1012 = 10004 tebibyte (TiB) 240 = 10244
10%
99,511,627,776
petabyte (PB) 1015 = 10005 pebibyte (PiB) 250 = 10245
12.6%
125,899,906,842,624
exabyte (EB) 1018 = 10006 exbibyte (EiB) 260 = 10246
15.3%
152,921,504,606,846,976
zettabyte (ZB) 1021 = 10007 zebibyte (ZiB) 270 = 10247
18.1%
180,591,620,717,411,303,424
yottabyte (YB) 1024 = 10008 yobibyte (YiB) 280 = 10248
20.9%
208,925,819,614,629,174,706,176

*Increase, rounded to the nearest tenth

Related topics

References

  1. What is big-endian? - A Word Definition From the Webopedia Computer Dictionary (Accessed April 15th, 2007).
  2. Dave Wilton (2006-04-8). Wordorigins.org; bit/byte.
  3. Bob Bemer (Accessed April 12th, 2007). Origins of the Term "BYTE".
  4. IEEE Trial-Use Standard for Prefixes for Binary Multiples (Accessed April 14th, 2007).
  5. Nate Mook (2006-06-28). Western Digital Settles Capacity Suit.