HTTP: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Pat Palmer
No edit summary
imported>Pat Palmer
mNo edit summary
 
(38 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{subpages}}
{{subpages}}


'''HTTP''' (the Hypertext Transfer Protocol) is an [[ASCII]]-text, [[client-server]] protocol which is the most prevalent messaging standard used by the [[World Wide Web]], in which an HTTP client program ([[web browser]]) sends an ASCII-text HTTP request message to an HTTP server program ([[web server]]), and the web server sends back an ASCII-text HTTP response in replyThe payload sent as part of HTTP requests and responses is also plain text, but using by special encodings, the payload can actually be binary information such as image files (though they must be encoded as ASCII text for transport by HTTP and then decoded back to their binary format on the other end)The same HTTP protocol used by web browsers is also used by search engines to index the World Wide Web, as well as by so-called spam-bots which scrape web pages to obtain information for malicious purposes.  
'''HTTP''' (Hypertext Transfer Protocol) is a [[client-server]] protocol used for file exchange on the [[World Wide Web]].  HTTP is the means of communications between a [[web browser]] (a program which acts as an HTTP ''client'') and a [[web server]] (a program which acts as an HTTP ''server'').  Both the HTTP request and response consist of a file, or array of bytes, in which each byte must be a member of the [[ASCII|ASCII character set]].  Besides protocol overhead, the request optionally may contain one or more named parameters elaborating on the request, and the response optionally may contain an ASCII-text file "payload"By using character encoding algorithms, the response can effectively send binary information such as image or audio.  When binary files are returned, the byte stream is encoded as ASCII text by the HTTP server for transport across the internet and must then be decoded back to its binary format upon reception by the HTTP clientWeb browsers are not the only type of HTTP client program.  HTTP is also used by [[search engines]] to [[index]] the World Wide Web, as well as by so-called ''spam-bots'' which [[scrape]] web pages to obtain information for malicious purposes.
 
==The HTTP specifications==
The two HTTP specifications in active use today contain precise, elegant language which is accessible to lay readers if they carefully consider the definitions provided near the top of the documents:


Two versions of HTTP may still be used today, although most web servers implement the latter for efficiency reasons:
*W3C RFC 1945, ''Hypertext Transfer Protocol -- HTTP/1.0''<ref name="HTTP10protocol">{{cite web|url=http://www.w3.org/Protocols/rfc1945/rfc1945|title=Request for Comments: 1945, Hypertext Transfer Protocol -- HTTP/1.0|publisher=IETF Network Working Group|date=May 1996|accessdate=2007-04-02}}</ref>, which specifies how the browser and server communicate with each other  
*W3C RFC 1945, ''Hypertext Transfer Protocol -- HTTP/1.0''<ref name="HTTP10protocol">{{cite web|url=http://www.w3.org/Protocols/rfc1945/rfc1945|title=Request for Comments: 1945, Hypertext Transfer Protocol -- HTTP/1.0|publisher=IETF Network Working Group|date=May 1996|accessdate=2007-04-02}}</ref>, which specifies how the browser and server communicate with each other  
*W3C RFC 2616, ''Hypertext Transfer Protocol -- HTTP/1.1''<ref name="HTTP11protocol">{{cite web|url=http://www.ietf.org/rfc/rfc2616.txt|title=Request for Comments: 2616, Hypertext Transfer Protocol -- HTTP/1.1|publisher=IETF Network Working Group|date=June 1999|accessdate=2011-06-12}}</ref>, which adds caching and keep-alive connections to the original specification  
*W3C RFC 2616, ''Hypertext Transfer Protocol -- HTTP/1.1''<ref name="HTTP11protocol">{{cite web|url=http://www.ietf.org/rfc/rfc2616.txt|title=Request for Comments: 2616, Hypertext Transfer Protocol -- HTTP/1.1|publisher=IETF Network Working Group|date=June 1999|accessdate=2011-06-12}}</ref>, which adds caching, proxies and keep-alive connections to the original specification


HTTP is one of several ''well-known TCP applications'' which can ride on top of the [[internet]]'s [[Transmission Control Protocol]], and HTTP is assigned the ''well-known'' TCP port number of 80.  This is important, though perhaps not obvious, because browsing to a URL such as ''<nowiki>http://diatom.ansp.org/</nowiki>'' is actually equivalent to browsing to  ''<nowiki>http://diatom.ansp.org:80/</nowiki>''.  If two web server programs were to execute simultaneously on a single host computer, it would be necessary for one of them to be configured to use a less-known TCP port number above 8000 (since ''well-known'' TCP ports are supposed to be below 8000).  A URL such as ''<nowiki>http://diatom.ansp.org:8080/</nowiki>'' could be used to address an HTTP request to the second web server on a computer whose [[DNS]] name is ''diatom.ansp.org'', assuming the second HTTP server software were configured to listen on TCP port 8080.  If two web servers both run on one computer and both attempt to listen on TCP port 80 (the HTTP default), neither will work correctly.
==HTTP and the Transmission Control Protocol==


HTTP's original purpose was the transfer of [[HTML|Hypertext Markup Language]] and other page description methods such as [[Cascading Style Sheets|cascading style sheets (CSS)]]. HTTP is a relatively simple protocol, which relies on the [[Transmission Control Protocol]] to ensure its traffic is carried, free from errors, over [[Internet Protocol]] networks. It works in the same manner if the users or servers are connected to the public [[Internet]], an [[intranet]], or an [[extranet]]. HTTP needs to be supplemented to provide security of the message transfer.<ref name=RFC2818 >{{citation
HTTP requests and responses rely upon the [[Internet]]'s [[Transmission Control Protocol]] (TCP) for error-free transmission between client programs and server programs.   HTTP is one of several ''[[well-known TCP applications]]'' and is assigned the [[well-known TCP port number]] of 80. Thus, browsing to a URL such as ''<nowiki>http://diatom.ansp.org/</nowiki>'' is actually equivalent to browsing to ''<nowiki>http://diatom.ansp.org:80/</nowiki>''. If two web server programs were to execute simultaneously on a single host computer, it would be necessary for one of them to be configured to use a TCP port number above 8000 (since ''well-known'' TCP ports are supposed to be below 8000). A URL such as ''<nowiki>http://diatom.ansp.org:8080/</nowiki>'' might be used, for example, to address an HTTP request to a second web server on a computer whose [[DNS]] name is ''diatom.ansp.org'', assuming the second HTTP server program were configured to [[listen]] on TCP port 8080.  If two web servers were to run simultaneously on one computer and both attempted to listen for requests on TCP port 80 (the HTTP default), neither would work correctly.
| id = RFC2818
| title = HTTP Over TLS
| author = Rescorla, E.
| date = May 2000
| publisher = Internet Engineering Task Force
| url = http://www.ietf.org/rfc/rfc2818.txt}}</ref>


The [[World Wide Web]] is more than HTML and HTTP alone. It includes a wide range of administrative techniques, performance-enhancing methods such as [[web cache]]s and [[content distribution network]]s, and  and has a robust caching system.
==HTTP and URL's==


==History==
HTTP requests and responses consist of several header lines and, optionally, a body.  The location of the server program and the exact web page or file the client is requesting are both precisely identified in a ''Uniform Resource Identifier'' (URI)<ref name="URIprotocol">{{cite web|url=http://www.ietf.org/rfc/rfc3986.txt|title=Request for Comments: 3986, Uniform Resource Identifier (URI): Generic Syntax|publisher=IETF Network Working Group|date=January 2005|accessdate=2007-04-02}}</ref>.  
HTTP was created at [[CERN]] by [[Tim Berners-Lee]] in 1989 as a way to share hypertext documents.<ref name=CERN>{{citation
| first = Tim | last = Berners-Lee
| url = http://info.cern.ch/Proposal.html
| date = March 1989
| title = Tim Berners-Lee's proposal: "Information Management: a Proposal"}}</ref> Around 1992 with the availability of the first web browser ([[Mosaic]]), HTTP and HTML together began to be used by other sites, primarily in the scientific world. The availability of the [[Mosaic]] [[web browser]] and the [[NCSA HTTPd]] web server, both developed at the [[National Center for Supercomputing Applications]] by [[Marc Andreessen]], were key to the explosion in popularity of HTTP and HTML that followed.


The first (1990) version of HTTP, called HTTP/0.9, was a simple protocol for raw data transfer across the Internet. HTTP/1.0, as defined by RFC 1945 (1996), improved the protocol by allowing messages to be in a self-describing language, [[HTML]], containing metadata about the location of the user-desired information and how to handle the request and response.
==Example conversation between an HTTP server and client==


Based on experience with the operational Web, however, HTTP/1.0 did not deal well with real-world needs such as  hierarchical [[proxy (computer}|proxies]], [[web cache]]s, the need for persistent communications for long sessions, and [[virtual web server]]s. There were enough optional features that a client and server needed to exchange information about their capabilities before the user information transfer could begin. To meet those needs,  HTTP/1.1 was developed.<ref name=RFC2616 >{{citation
The following is a typical HTTP message exchange of the kind which is carried out every time a page is loaded in a web browser. In the following example, the user has entered the [[URL]] ''<nowiki>http://www.lth.se/</nowiki>'' in a browser and clicked. The browser then sends an HTTP request, using [[TCP]], to the computer whose DNS name is ''www.lth.se''.  The ''www.lth.se'' computer must have an HTTP server programming executing and configured to listen on TCP port 80.  If this program is present, the TCP program on ''www.lth.se'' passes the HTTP request to the HTTP server program.  The HTTP server program will then send back, via TCP, and HTTP response containing the web server's default page. Comments below are shown in italics and are not part of the actual conversation.
| id = RFC2616
| title = Hypertext Transfer Protocol -- HTTP/1.1
| author = Fielding, R. ''et al.''
| date = June 1999
| publisher = Internet Engineering Task Force
| url = http://www.ietf.org/rfc/rfc2616.txt}}</ref>
 
==Technical details==
The HTTP protocol follows a client-server model, where the client issues a request for a resource to the server. Requests and responses consist of several headers and, optionally, a body. Resources are identified using a URI ([[Uniform Resource Identifier]]).
 
===Example conversation===
 
The following is a typical client-server conversation of the kind which is carried out every time a page is loaded in a web browser. In this case, the user has entered the address 'http://www.lth.se/' in his browser and clicked. A 'request' is made by the browser to port 80 on the server www.lth.se, and the server responds with the home page. Comments are in italics and are not part of the actual conversation.


Request:
Request:
Line 83: Line 61:
                                         ''Please <u>do not</u> repeat the request.''
                                         ''Please <u>do not</u> repeat the request.''


===Request methods===
==HTTP request methods==
Clients can use one of eight request methods:
HTTP clients can use one of eight request methods:
* HEAD
* HEAD
* GET
* GET
Line 94: Line 72:
* CONNECT
* CONNECT


Typically, only GET, HEAD and POST methods are used in web applications, although protocols like [[WebDAV]] make use of others.
In practice, it is mainly the GET, POST and HEAD methods that are used in web applications, although protocols like [[WebDAV]] make use of others.
 
==Status codes==
An HTTP response (sent by a server in reply to an HTTP request) includes a status header, which informs the client whether the request succeeded. The status header is made up of a "status code" and a "reason phrase" (descriptive text).  


===Status codes===
Server responses include a status header, which informs the client whether the request succeeded. The status header is made up of a "status code" and a "reason phrase" (descriptive text).
====Status codes classes====
====Status codes classes====
Status codes are grouped into classes:  
Status codes are grouped into classes:  
Line 110: Line 89:
HTTP applications are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, applications MUST understand the class of any status code, as indicated by the first digit, and treat any unrecognized response as being equivalent to the x00 status code of that class, with the exception that an unrecognized response MUST NOT be cached.
HTTP applications are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, applications MUST understand the class of any status code, as indicated by the first digit, and treat any unrecognized response as being equivalent to the x00 status code of that class, with the exception that an unrecognized response MUST NOT be cached.
</blockquote>
</blockquote>
====All W3C status codes====
====All W3C status codes====
All the codes are described in RFC 2616.
All the codes are described in RFC 2616.
===HTTP header and cache management===
 
==HTTP header and cache management==


The HTTP message header includes a number of fields used to facilitate cache management. One of these, Etag (entity tag) is a string valued field that represents a value that should (weak entity tag) or must (strong entity tag) change whenever the page (or other resource) is modified. This allows browsers or other clients to determine whether or not the entire resource needs to be downloaded. The HEAD method, which returns the same message header that would be included in the response to a GET request, can be used to determine if a cached copy of the resource is up to date without actually downloading a new copy. Other elements of the message header can be used, for example, to indicate when a copy should expire (no longer be considered valid), or that it should not be cached at all. This can be useful, for example, when data is generated dynamically (for example, the number of visits to a web site).
The HTTP message header includes a number of fields used to facilitate cache management. One of these, Etag (entity tag) is a string valued field that represents a value that should (weak entity tag) or must (strong entity tag) change whenever the page (or other resource) is modified. This allows browsers or other clients to determine whether or not the entire resource needs to be downloaded. The HEAD method, which returns the same message header that would be included in the response to a GET request, can be used to determine if a cached copy of the resource is up to date without actually downloading a new copy. Other elements of the message header can be used, for example, to indicate when a copy should expire (no longer be considered valid), or that it should not be cached at all. This can be useful, for example, when data is generated dynamically (for example, the number of visits to a web site).
==HTTP server operations==
==HTTP server operations==
Multiple virtual servers may map onto a single physical computer. For effective server use, they must be on networks engineered to handle the traffic with them; see [[port scanning]] for [[Internet Service Provider]] checking for servers placed where traffic can create problems.
Multiple virtual servers may map onto a single physical computer. For effective server use, they must be on networks engineered to handle the traffic with them; see [[port scanning]] for [[Internet Service Provider]] checking for servers placed where traffic can create problems.
==History==
HTTP was designed at [[CERN]] by [[Tim Berners-Lee]] in 1989 as a way to share hypertext documents.<ref name=CERN>{{citation
| first = Tim | last = Berners-Lee
| url = http://info.cern.ch/Proposal.html
| date = March 1989
| title = Tim Berners-Lee's proposal: "Information Management: a Proposal"}}</ref> Around 1992 with the availability of the first web browser ([[Mosaic]]), HTTP and HTML together began to be used by other sites, primarily in the scientific world. The availability of the [[Mosaic]] [[web browser]] and the [[NCSA HTTPd]] web server, both developed at the [[National Center for Supercomputing Applications]] by [[Marc Andreessen]], were key to the explosion in popularity of HTTP and HTML that followed.
The first (1990) version of HTTP, called HTTP/0.9, was a bare-bones protocol for raw data transfer across the Internet. HTTP/1.0 (1996) improved the protocol by allowing payload message to be in a self-describing language [[HTML]], along with the addition of metadata about the location of the requested information and other directives on how to handle the request and response.
However, HTTP/1.0 was still limited, lacking [[web cache]]s and persistent connections for repeated messaging, and [[virtual web server]]s. To meet those needs,  HTTP/1.1 was developed.
==References==
==References==
{{reflist}}
{{reflist}}

Latest revision as of 07:11, 24 June 2011

This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

HTTP (Hypertext Transfer Protocol) is a client-server protocol used for file exchange on the World Wide Web. HTTP is the means of communications between a web browser (a program which acts as an HTTP client) and a web server (a program which acts as an HTTP server). Both the HTTP request and response consist of a file, or array of bytes, in which each byte must be a member of the ASCII character set. Besides protocol overhead, the request optionally may contain one or more named parameters elaborating on the request, and the response optionally may contain an ASCII-text file "payload". By using character encoding algorithms, the response can effectively send binary information such as image or audio. When binary files are returned, the byte stream is encoded as ASCII text by the HTTP server for transport across the internet and must then be decoded back to its binary format upon reception by the HTTP client. Web browsers are not the only type of HTTP client program. HTTP is also used by search engines to index the World Wide Web, as well as by so-called spam-bots which scrape web pages to obtain information for malicious purposes.

The HTTP specifications

The two HTTP specifications in active use today contain precise, elegant language which is accessible to lay readers if they carefully consider the definitions provided near the top of the documents:

  • W3C RFC 1945, Hypertext Transfer Protocol -- HTTP/1.0[1], which specifies how the browser and server communicate with each other
  • W3C RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1[2], which adds caching, proxies and keep-alive connections to the original specification

HTTP and the Transmission Control Protocol

HTTP requests and responses rely upon the Internet's Transmission Control Protocol (TCP) for error-free transmission between client programs and server programs. HTTP is one of several well-known TCP applications and is assigned the well-known TCP port number of 80. Thus, browsing to a URL such as http://diatom.ansp.org/ is actually equivalent to browsing to http://diatom.ansp.org:80/. If two web server programs were to execute simultaneously on a single host computer, it would be necessary for one of them to be configured to use a TCP port number above 8000 (since well-known TCP ports are supposed to be below 8000). A URL such as http://diatom.ansp.org:8080/ might be used, for example, to address an HTTP request to a second web server on a computer whose DNS name is diatom.ansp.org, assuming the second HTTP server program were configured to listen on TCP port 8080. If two web servers were to run simultaneously on one computer and both attempted to listen for requests on TCP port 80 (the HTTP default), neither would work correctly.

HTTP and URL's

HTTP requests and responses consist of several header lines and, optionally, a body. The location of the server program and the exact web page or file the client is requesting are both precisely identified in a Uniform Resource Identifier (URI)[3].

Example conversation between an HTTP server and client

The following is a typical HTTP message exchange of the kind which is carried out every time a page is loaded in a web browser. In the following example, the user has entered the URL http://www.lth.se/ in a browser and clicked. The browser then sends an HTTP request, using TCP, to the computer whose DNS name is www.lth.se. The www.lth.se computer must have an HTTP server programming executing and configured to listen on TCP port 80. If this program is present, the TCP program on www.lth.se passes the HTTP request to the HTTP server program. The HTTP server program will then send back, via TCP, and HTTP response containing the web server's default page. Comments below are shown in italics and are not part of the actual conversation.

Request:

GET / HTTP/1.1                          Please transmit your root page, using HTTP 1.1,
Host: www.lth.se                        located at www.lth.se.
User-Agent: Mozilla/5.0 (Linux i686)    I am Mozilla 5.0, running on i686 Linux.
Accept: text/html                       I understand HTML-coded documents.
Accept-Language: sv, en-gb              I prefer pages in Swedish and British English.
Accept-Encoding: gzip, deflate          You may compress your content as gzip or deflate, if you wish.
Accept-Charset: utf-8, ISO-8859-1       I understand text encoded in Unicode and Latin-1.
                                        (Empty line signifies end of request)

Response:

HTTP/1.1 200 OK                         Request is valid; I am complying according to HTTP 1.1 (code 200).
Date: Wed, 26 May 2010 10:33:59         The time (where I am) is 10:33 [...].
Server: Apache/2.2.3 (Red Hat)          I am Apache 2.2.3, running on Red Hat Linux.
Content-Length: 54283                   The content you requested is 54,283 bytes long.
Content-Type: text/html; charset=utf-8  Prepare to receive text encoded in Unicode, to be interpreted as HTML.
<html>                                  (The webpage www.lth.se/ follows)
  <head>
    <title>Some page title</title>
...

These are a few additional example responses that could also occur:


Response:

HTTP/1.1 400 Bad Request                Request does not conform to HTTP/1.1, and I did not understand it.
                                        Please do not repeat the request in its current form.

Response:

HTTP/1.1 404 Not Found                  Request comforms with HTTP/1.1, but the resource requested was not found here.
                                        Please do not repeat the request for this resource.

Response:

HTTP/1.1 403 Forbidden                  Request is conformant and valid, but I refuse to comply under HTTP/1.1.
                                        Please do not repeat the request.

HTTP request methods

HTTP clients can use one of eight request methods:

  • HEAD
  • GET
  • POST
  • PUT
  • DELETE
  • TRACE
  • OPTIONS
  • CONNECT

In practice, it is mainly the GET, POST and HEAD methods that are used in web applications, although protocols like WebDAV make use of others.

Status codes

An HTTP response (sent by a server in reply to an HTTP request) includes a status header, which informs the client whether the request succeeded. The status header is made up of a "status code" and a "reason phrase" (descriptive text).

Status codes classes

Status codes are grouped into classes:

  • 1xx (informational) : Request received, continuing process
  • 2xx (success) : The action was successfully received, understood, and accepted
  • 3xx (redirect) : Further action must be taken in order to complete the request
  • 4xx (client error) : The request contains bad syntax or cannot be fulfilled
  • 5xx (server error) : The server failed to fulfill an apparently valid request.

For example, if the client requests a non-existent document, the status code will be "404 Not Found".

According to the W3C consortium :

HTTP applications are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, applications MUST understand the class of any status code, as indicated by the first digit, and treat any unrecognized response as being equivalent to the x00 status code of that class, with the exception that an unrecognized response MUST NOT be cached.

All W3C status codes

All the codes are described in RFC 2616.

HTTP header and cache management

The HTTP message header includes a number of fields used to facilitate cache management. One of these, Etag (entity tag) is a string valued field that represents a value that should (weak entity tag) or must (strong entity tag) change whenever the page (or other resource) is modified. This allows browsers or other clients to determine whether or not the entire resource needs to be downloaded. The HEAD method, which returns the same message header that would be included in the response to a GET request, can be used to determine if a cached copy of the resource is up to date without actually downloading a new copy. Other elements of the message header can be used, for example, to indicate when a copy should expire (no longer be considered valid), or that it should not be cached at all. This can be useful, for example, when data is generated dynamically (for example, the number of visits to a web site).

HTTP server operations

Multiple virtual servers may map onto a single physical computer. For effective server use, they must be on networks engineered to handle the traffic with them; see port scanning for Internet Service Provider checking for servers placed where traffic can create problems.

History

HTTP was designed at CERN by Tim Berners-Lee in 1989 as a way to share hypertext documents.[4] Around 1992 with the availability of the first web browser (Mosaic), HTTP and HTML together began to be used by other sites, primarily in the scientific world. The availability of the Mosaic web browser and the NCSA HTTPd web server, both developed at the National Center for Supercomputing Applications by Marc Andreessen, were key to the explosion in popularity of HTTP and HTML that followed.

The first (1990) version of HTTP, called HTTP/0.9, was a bare-bones protocol for raw data transfer across the Internet. HTTP/1.0 (1996) improved the protocol by allowing payload message to be in a self-describing language HTML, along with the addition of metadata about the location of the requested information and other directives on how to handle the request and response.

However, HTTP/1.0 was still limited, lacking web caches and persistent connections for repeated messaging, and virtual web servers. To meet those needs, HTTP/1.1 was developed.

References

  1. Request for Comments: 1945, Hypertext Transfer Protocol -- HTTP/1.0. IETF Network Working Group (May 1996). Retrieved on 2007-04-02.
  2. Request for Comments: 2616, Hypertext Transfer Protocol -- HTTP/1.1. IETF Network Working Group (June 1999). Retrieved on 2011-06-12.
  3. Request for Comments: 3986, Uniform Resource Identifier (URI): Generic Syntax. IETF Network Working Group (January 2005). Retrieved on 2007-04-02.
  4. Berners-Lee, Tim (March 1989), Tim Berners-Lee's proposal: "Information Management: a Proposal"