Home | Networks & The Internet | All About Computers | Our Warped Views | Downloads | Forums
28th August, 2008
Contact Us | Usage Agreement

Email And The World Wide Web


 

HTTP - The Guts of WWW

Created: 13th June 2008

 

The Internet is such a broad term, it can mean so many things, within the Internet we have email, newsgroups, bit torrent, online gaming etc, but this article focuses on the part you probably know best: The World Wide Web.

Although the 'World Wide Web (WWW)' sounds like yet another term for the Internet, it's actually not. It's describing a specific aspect of the web, the process of viewing content remotely using the HTTP protocol.

This article goes under the cover of the Word Wide Web and into the protocol that brings us Facebook, Google and eBay - HTTP - the Hypertext Transfer Protocol. I warn you it's not a light read (easier then the RFC though!) but such a subject has a diverse history, and the technology has constantly had to adapt to a new world of online users. If you're interested in how Facebook's Webpage's get to your computer - then this guide is for you.

Defining the Standard

The HTTP protocol has been widely accredited to one person - Tim Berners-Lee. Tim began work on HTTP back in 1990 when the need for a much easier way to distribute information between computers became obvious. While working at CERN, he wrote the first ever web server and web browser, starting the revolution we now know as URL's, HTML and WWW. Once the technology matured, Tim wrote the first official RFC for two versions of his new protocol - RFC1945 - Hypertext Transfer Protocol -- HTTP/1.0.

Note:

The first ever version of HTTP was actually HTTP/0.9 created around 1990 by Tim Berners-Lee. It was extremely basic and didn't serve much use outside of the CERN lab - but because it paved the way for HTTP/1.0, Berners-Lee included it in the first official RFC.

HTTP/1.0 defined the protocol. Simply put - other people could now create applications to interact with HTTP, and that's exactly what happened. Companies like Netscape popped up and Microsoft took an interest with their 'Internet Explorer' add-on to Windows 95, and before you knew it - the world was talking HTTP. The protocol was defined in 1996 - at least a year after the first commercial browser came out. This goes to show the protocol was an instant success long before the inventors could even document it - an event that happens much throughout IT history (wireless for example!).

HTTP Settings in IE7
Internet Explorer 7 still lets you choose your HTTP version, though HTTP/1.1 is the default.

HTTP/1.0 enjoyed a good run from 1996 until 1999. Even after 1999 it continued to be used by many servers and browsers, and is still a large percentage of Internet traffic to this day. But in June 1999, HTTP/1.1 was defined by Berners-Lee and many other talented people from around the globe. Known as RFC2616 - HTTP/1.1 contained much needed improvements and enhancements over HTTP/1.0, in fact- there wasn't much to document because many commercial products, struggling to keep up with the demand for new features - added their own technologies which became unofficially known as HTTP/1.0+.

Today we mostly use HTTP/1.1 with a little HTTP/1.0 for some devices. All modern browsers understand both protocols, but try to use the latest if possible. HTTP/1.0 still lingers around because of some proxy servers and gateways that exist on the Internet even today, however there's little between them that causes problems for your average modern Internet user.

 

HTTP In a Nutshell

Regardless of the protocol version and its intricacies, HTTP was designed to do one thing: retrieve remote documents. Although that's a gross oversimplification, it is the main purpose of HTTP today. We use it every time we view a web page, fill in an online form or search the Internet.

Web pages are simply files made up of objects. For example, this page is a page containing text and pictures. The file itself is an object, and each image is an object. They are all stored on a server just like data is stored on your PC. In order for you to view them, your computer has to download these objects from the server, and display them on your screen. HTTP facilitates this process. Your browser constructs the necessary HTTP request message, sends it to the server containing this page (found using the URL you enter, or by following the information in a link\button), then sends that HTTP request to the server. If everything goes well, the server sends a HTTP response - including the files (objects) you asked for.

Finally your browser saves them into the Internet cache on your computer, and displays them. Simple ha?

The diagram below shows a typical HTTP connection between a client and server, I've also included some of the basic TCP settings such as port number to highlight how HTTP relies on the transport layer underneath to handle the specifics of data transfer.

A simple HTTP connection
A basic HTTP connection to port 80 of a web server.

Of course things will always get more complicated. Many modern web pages are made up of several objects, so HTTP has to fetch dozens of objects, each one requiring a request\response message, and each one needing to travel between the server and your computer. HTTP makes easy work of this thanks to being lightweight and using simple text based headers, even though it's capable of transferring binary data.

Before moving on, it's worth noting the main differences between HTTP/1.0 and HTTP/1.1. The main jump was the introduction of persistent connections. HTTP/1.0 had to create a new connection for every single object on a web page. This wasn't so much of an issue when a web page consisted of mostly text and the odd image, but you couldn't realistically download Amazon's homepage in any kind of decent time frame using HTTP/1.0. HTTP/1.1 allows clients and servers to reuse the same TCP connection to transfer multiple objects, thanks to the Connection header.

Not only did persistent connections come into effect, but HTTP/1.1 introduced improved cache handling, compression of data during transfer, error handling and proxy authentication. All features that we rely on today. HTTP/1.1 is the most widely used version of HTTP even today.

How did HTML get involved with HTTP?

Tim Berners-Lee created the very first HTML language by basing it on the then popular SGML (Standard Generalized Markup Language). It was used to tell clients how to display the data, while HTTP simply transferred it. HTML quickly became an open standard, with many scientists and academics around the world contributing to its success. In fact, you can even find the early discussion on the net today, an example can be found here, where a discussion for the now valuable <img> tag was born.

HTTP is very much a web standard now, and is still being developed. You can find the latest discussions (as of 2008) here.

Detailed Look at HTTP

Probably the best way to see how HTTP works is to see it in action, in this section; we'll take a look at a very basic - yet valid HTTP conversation between a Windows Internet Explorer client and Microsoft's IIS server. We'll transfer a single HTML page, only containing text, and then increase the complexity to show exactly how HTTP handles this content.

To start with, let's perform a simple connection between the client and server; it's as simple as going to the servers root URL. Below is a simple diagram, along with the conversation HTTP headers:

A detailed HTTP connection
Connection showing the object being requested, and the server returning a response code of 200 along with the data.

Client request:

GET / HTTP/1.1
User-Agent: Opera/9.26 (Windows NT 5.1; U; en)
Host: 192.168.1.10
Accept: text/html
Cache-Control: no-cache
Connection: Keep-Alive

Server Response:

HTTP/1.1 200 OK
Content-Length: 998
Content-Type: text/html
Server: Microsoft-IIS/6.0
Date: Sun, 8 Jun 2008 20:01:10 GMT
<html>
<head><title>this is a test</title>
(...)

The actual HTTP headers are in bold, and their values are listed to the right. We start with a client request. The first line is the actual request, asking the server to 'get' the file test.txt using the HTTP/1.1 protocol. The request is sent to the server located in the 'Host' header (couple the host header to the relative filename in the request and you have a URL). The other headers vary from system to system, though the ones you see here are the most common. Here's a brief description of HTTP/1.1's most common headers:

AcceptInforms the server what media types the client can handleRequest
User-AgentContains information about the requesting client, for informational or tracking purposes onlyRequest
HostThe host machine to which the request is intendedRequest
Content-LengthThe length of the response, used to alert clients that data is availableRequest and Response
Content-TypeThe type of data being returned as the result of a request.Response
Last-ModifiedThe date and time the server believes the file was last modifiedResponse
ServerThe server name and possibly versionResponse
DateThe date at which the response was sentResponse
If-Modified-SinceIf the object requested hasn't been modified since this date, the server will return a 304 code and that's it. However if the object is newer then the date in the request, the server processes it as usualRequest
ConnectionSpecify the type of connection between two endpoints, used to signal a connection should be kept open or closedRequest and Response
A full list of HTTP header explanation can be found in the HTTP/1.1 RFC located here.

There are many more headers which HTTP can use during normal request\responses - Authorization, caching and data formatting to name a few; however these are the most common that you would expect to see on modern web servers\browsers.

With every request that HTTP sends, a response code is returned (assuming the server received the request!), HTTP response codes are grouped into 5 sets of numbers:

1xx - Provisional codes: used to inform the client of single-lined informational message
2xx - used to signal a successful request that the server can process. Common 2xx messages are 200 OK and 202 Accepted
3xx - Redirection codes - used to inform the client that further action is required such as following a temporary URL. The response generally includes the new location of the file. Examples are 301 Move permanently and 304 not modified.
4xx - Indicates something was wrong with the client request. The most common being 404 not found, meaning the client requested an object or file that doesn't exist. Others include 401 unauthorized and 403 forbidden.
5xx - indicates the server did something wrong, common errors are 500 internal server error (usually occurs when permissions or configurations are incorrect for the web server to access resources) and 503 service unavailable meaning something has interrupted the HTTP session.

Modern browsers are configured to read these response codes and give friendly errors accordingly.

HTTP/1.1 was also tweaked to include enhancements for intermediate devices. After all, today it's becoming less and less likely that you're actually talking to the web server directly as networks become more secure and efficient.

Proxy servers and caching engines have always suffered the hardships of interpreting data as it traverses the network. HTTP/1.1 allows content to be marked as cacheable, non-cacheable and also configures more accurate document expiration as we'll briefly look at in the next section.

Page 1 | Page 2 >>
All logos and trademarks in this site are property of their respective owner.
The Serpent.co.uk © 2005 by John Payne. Site owned and maintained by John Payne. For emails to the webmaster, please use the feedback form.
All articles, guides and tutorials are subject to The Serpent Usage Agreement. Please read before following any advice on this site.

About Us | Contact Us | Privacy Policy | © 2005 The-serpent.co.uk