A5 - HTTP Server

350 - 600 points

Run the server (your implementation)

python3 -m a5_http_server

How many websites do you visit per day? Do you need a specialized program for every one of them? Probably not. It turns out that it has not always been the same. Before the invention of HTTP protocol (the backbone of the web, by Tim Berners Lee) people used different programs to access different parts of the web. What made it possible was the unification of application-layer web protocols.

However, the complexities of dealing with HTTP are usually hidden behind the abstraction of web server frameworks like Node.js, Flask, Django, and others. Not today! In this assignment, you will explore what happens behind the curtains by implementing a simplified HTTP server.

Assignment (350 points)

Your job is to implement a simplified HTTP 1.1 server supporting GET methods. The website, based on HTML files and images we provide, should be loadable from your server and be visible in any browser. Please note, that while most modern browsers will render you an HTML document even if was sent by a protocol other than HTTP (for example FTP), for this assignment you are required to use HTTP for all client-server communication.

We provide a minimal webpage that your implementation must be able to statically serve. Your server must support GET requests and be able to serve different types of files typically found on a webpage, such as jpeg images, HTML, and CSS files. Aditionally, your server shall support different status codes, such as 200, 201, 400, and 404 and serve the correct HTML file depending on the response status code.

Specification

The main document you will be working with is RFC 2616, the specification for the HTTP/1.1 protocol. You will not have to implement the entire HTTP/1.1 specification, however some sections of interest are:

Section 1 - Introduction A general overview of the HTTP/1.1 protocol and its purpose.
Section 5.1 & 5.2 - Request Line Learn how to parse requests sent by browsers, the message format and its standard contents.
Section 9.3 - The GET method
Section 3.2 - HTTP Version
Section 7.2.1 - Type Specify the content type served, i.e., jpeg, html, and so on.
Section 7.2.2 - Entity Length Specify the content length, required for determining the end of the transmitted data.
Sections 10.2.1 - 10.4.1 - Status Codes Descriptions on status codes you will have to supoort. Note that we will only require a subset of the presented status codes.
Section 8 - Connections Support for persistent connections (sk yourself: why were those important to add?). Without the persistent connections support, your browser will not likely load your HTTP 1.1 page.
Section 8.2.3 & 8.2.4 - Use of the 100 (Continue) Status & Client Behavior if Server Prematurely Closes Connection Learn about the ways you should deal with connections based on their state.

RFCs may seem overly complex, holding trumendous amounts of information that one cannot process in one sitting. The key to using RFCs to their whole potential is treating them as a tool rather than a book to read. Use RFCs to guide your development process rather than a prerequisite read to starting to write code.

That being said, we recommend reading at least once the Introduction section to familiarize yourself with the protocol.

Requirements

Your assignment must implement a subset of the features described in RFC 2616. In short, your server should implement functionality that enables statically serving any webpage. We define the following requirements as a smaller specification based on the RFC to guide your development process.

The following requirements indicate the protocol specification the server should adhere to, as well as technical requirements regarding implementation.

The server must support HTTP 1.1.
The server must support the GET method.
The server must support HTTP 1.1 persistent connections.
The implementation must not use either of the exit() or sendall() Python functions.
The server must support multiple concurrent connections.

The following requirements describe the subset of features that should be implemented from RFC 2616.

The server has to support the 200, 400, and 404 status codes.
The server must respond, indicating an appropriate response type in the header.
The server must indicate the appropriate response length in the header.
The server must indicate the appropriate protocol type (HTTP 1.1) in the header.
The server must use UTF-8 encoding and indicate it in every response header.
If a page is not found, the server must serve the file 404.html.
When requested with /, the server must respond with the index.html page.
The server must be able to load any file from the filesystem in response to a GET request, including images, PDF documents, etc.
The server must support valid content types for the following file extensions: .pdf, .jpeg, .html, .js, .css, .png.

Bonus (250 points)

As HTTP is rapidly being adopted into more and more application-layer software, we start noticing the need for performant HTTP implementations. Namely, we want to be able to handle hundreds of thousands of concurrent connections and serve all of them with high throughput and low latency. Many web servers are currently competing for the number one spot in performance, but can we do the same?

Your task is to adapt your HTTP/1.1 server to serve a dynamic endpoint that gets updated through a POST request form. In doing so, consider performance, specifically handling many concurrent connections efficiently. Practice shows that handling each connection in a new thread will not get us far.

Part 1: Dynamic endpoint & `POST` requests

The personal_cats.html file contains a <table> entry, which is currently being statically served. You will have to update the file to include a special placeholder that you will then replace with the personal cat image links of each user. Each connection may submit different pictures to be included in the table, therefore, each connection should see a different version of the personal_cats.html file upon requesting it.

Submitting the image links should be done through a POST /data request, using a application/x-www-form-urlencoded encoding with the following fields:

Field Name

Description

cat_url

An url (link) of cat picture.

description

The description of the picture.

The table entry in the personal_cats.html file should be replaced as shown below. Your implementation should replace TABLE_PLACEHOLDER with the <tr><td>...</td></tr> entries containing the images stored by each user.

        <table style="text-align: center; width: 100%">
<!--            The contents of this table should be dynamically generated by the server.-->
<!--            The following are just examples of how the table should look like.-->
<!--            You will have to remove the contents of this table and dynamically generate the rows.-->
            <tr>
                <td>
                    <img src ="Here should be your source url" width="300"/>
                </td>
            </tr>
            <tr>
                <td>Cat name 1</td>
            </tr>
            <tr>
                <td>
                    <img src ="Here should be your source url" width="300"/>
                </td>
            </tr>
            <tr>
                <td>Cat name 2</td>
            </tr>
        </table>

 <table style="text-align: center; width: 100%">TABLE_PLACEHOLDER</table>

The state should not be persistently stored. It is ok if the images are gone if the user restarts their browser or the server is restarted.

Part 2: Handling concurrent connections efficiently

To handle many connections concurrently without losing performance, you will have to either implement a thread pool consumer algorithm or implement your HTTP server without using any threads at all, instead using the epoll interface. Epoll is a high-performance polling interface used in most top-tier web servers. You may choose which implementation you want to use, however, we expect a thread pool system to be cleverly designed to reuse threads after a connection closes.

If you are on MacOS, the epoll interface is not available to you, therefore, you will have to resort to implementing a thread pool.

Your implementation will be judged by the grading teaching assistant. If the implementation is not considered to meet the performance requirements of the bonus assignment, the submission may be rejected. There are no automated tests for performance testing on CodeGrade.

Bonus specification

The following requirements describe the expected behavior of the dynamic endpoints for the first part of the bonus. Please note that the personal_cats.html file will be treated as an endpoint following these requirements.

The server must support POST requests on the /data endpoint.
The /data endpoint must properly implement application/x-www-form-urlencoded using the prescribed fields.
The server must support dynamic GET requests on the /personal_cats.html endpoint. Namely, each connection should be able to see their images but not others'.
The server must implement the 201 status code and serve the success.html page with it upon successful access of the /data endpoint.
The server must support the status code 400 and serve the 400.html page with it upon a failed access of the /data endpoint.
The /data endpoint must check if the fields are empty and display an appropriate error if erroneous fields are inputted.
The /personal_cats.html endpoint must support replacing the TABLE_PLACEHOLDER placeholder, regardless of the file or context it is found in.

The following requirements describe the expected behavior of the server regarding its performance in handling many concurrent connections.

The server should either implement a thread pool or use the epoll interface for handling multiple connections.
If using thread pools, the server should reuse threads that are no longer in use after a client disconnects.
If using epoll, the server should not spawn any new threads apart from the main thread.
The server should be able to handle 1000 concurrent curl requests gracefully.

To qualify for the 50 bonus points for code quality, your implementation must be easily scalable to an indefinite number of endpoints, as well as being able to demonstrate its performance capabilities with several hundreds of parallel curl requests (at least as performant as Python can get).

Testing performance

To test the performance of your HTTP server you want to create several thousand connections in parallel and time your server's response time. If your server cannot handle many connections in parallel the wait time will be longer. The time command will help us determine the execution time of all NUM_CONNECTIONS parallel requests. We will run the curl command in parallel by using the & control operator.

Execute the script below to test your server's performance. Adjust the NUM_CONNECTIONS variable according to your test's magnitude. For comparison, our sample implementation can handle 3000 connections in 6.37s user 12.62s system 839% cpu 2.263 total.

export NUM_CONNECTIONS=3000
time (
    for i in $(seq 1 $NUM_CONNECTIONS); do 
        curl -s http://localhost:8000 >/dev/null & 
    done
)

The test mentioned above is not integrated in the automated testing environment. To pass the performance criteria, you must demonstrate during the sign-off that your implementation can serve the required number of parallel connections.

PreviousA4 - HTTP2 Trace Analysis NextA6 - DNS Server

Last updated 2 months ago