Faster HTTP Download


HTTP Download

Even though there are specific protocols for file transfer (FTP/SFTP) most content over the internet is served and transferred over HTTP. The data is sent back as HTTP response which doesn’t have a defined limit in RFC but is limited by the integer field in the HTTP header that specifies the content-length and sometimes imposed by servers as well.

As mentioned in the name the protocol is designed to send over text data and its use to transmit binary file content is somewhat of an overload in my opinion.

This article is about exploring how to use HTTP download more efficiently.

All code is using a 1 GB file hosted here for download testing purposes, and saves the file to disk as 1GB.zip.

url = 'http://212.183.159.230/1GB.zip'
filename = '1GB.zip'

Simple Download

Doing a simple curl or wget for a large file served over HTTP opens a keep-alive connection that either opens a stream or waits for the complete data to arrive at the application layer, kept in memory and then handed over it to the disk. At transport level (TCP) it is just one connection which is limited by the window size on how much data can be transferred at a time and once that chunk is acknowledged the next chunk is sent.

A simple Python code using requests can be written to simulate what curl or wget does.

response = requests.get(url)

with open(filename, 'wb') as fp:
    fp.write(response.content)

This basically keeps all the data in memory and then writes it back to the disk.

Using Streams

Memory can become a limiting factor while handling large files in this fashion. Most HTTP libraries provide a way to mitigate that by providing a stream handler. As soon as a predefined size of chunk has been received it gives back control to the application layer stack and let it handle the data, once it is done ask for more from the server. This reduces the memory requirement but significantly increases the disk write system call and increases the time taken to download the file.

cache = 10*1024*1024 # 10 MB

response = requests.get(url, stream=True)

with open(filename, 'wb') as fp:
    for chunk in response.iter_content(cache):
        fp.write(chunk)

HTTP Ranges

Aforementioned methods pose another issue, in case of failure there is no way to resume or do a partial download. To do that HTTP supports the ranges option which can specify the start and end bytes to be fetched from the server. This provides two benefits, first resuming a failed download and second opening multiple connections to overcome TCP window limit and utilizing the available bandwidth more efficiently. Being an optional option it has to be supported by the server and the client should maintain state to keep track of downloaded bytes so far.

Checking Server Support

There are two ways to check whether a server supports the HTTP ranges. First by sending a HEAD request to the url and checking if Accept-Ranges: bytes is present. Second, to poll the data directly with range options in the request header and seeing if the server responds with 206: partial response.

resp = requests.head(url)
supports = 'Accept-Ranges' in resp.headers and resp.headers['Accept-Ranges'] == 'bytes'

headers = {"Range": "bytes=0-100"}
resp = requests.get(url, headers=headers)
supports = resp.status_code == 206:

Getting Content Size

To properly plan a range download over multiple connections it is essential to get the total size of the content client is about to download. Every partial request comes with the information as part of the 'Content-Range' header.

Content-Range: bytes start-end/totalBytes

Polling to get total bytes of the content.

headers = {"Range": "bytes=0-1"}
response = requests.get(url, headers=headers)
rangedata = response.headers.get('Content-Range')
total_bytes = int(rangedata.split('/')[1])

Managing Download

Spawning concurrent worker processes to handle partial download is the way to go. But make sure that the client has enough cores to efficiently handle multiple processes and also check how many parallel connections does server support from a single IP.

segment_size = 100*1024*1024 # 100 MB
start = 0
end = start + segment_size

processes = []
part_files = []
segment = 1
cache = 10*1024*1024 # 10 MB strea cache
concurrent_conn = 4 # number of parallel connections
while True:
    partfilename = "part_{0}.zip".format(segment)
    part_files.append(partfilename)
    p = Process(target=downloadRange, args=(url, start, end, partfilename, cache))
    segment += 1
    processes.append(p)
    p.start()
    if len(processes) >= concurrent_conn:
        for p in processes:
            p.join()
        processes.clear()

    start = end + 1
    if start > total_bytes:
        break
    end = start + segment_size
    end = min(end, total_bytes)

if len(processes):
    for p in processes:
        p.join()

with open(filename, 'wb') as fp:
    for f in part_files:
        with open(f, 'rb') as fpart:
            fp.write(fpart.read())
        os.remove(f)

Result

As mentioned above all tests were done using a 1 GB file hosted here. Whenever the stream was enabled the cache size was 10 MB. For range download the segment size was 100 MB and six parallel connections were used.

  • Direct : 312.41 seconds
  • Direct (Stream) : 347.71 seconds
  • Range (Stream) : 237.13 seconds