Extracting files from a remote zip archive

2022-Nov-21

Here, I describe the process of figuring out how to extract files from a remote zip archive without having to download the entire remote zip file.

I recently needed a few files from a zip file that was located on a remote server. The zip file was about 250 GB, and downloading it to my local computer would take about 60 hours. Having seemingly no other alternative, I started the download and was resigned to having to use a lot of computer time and bandwidth over the next few days.

However, this also gave me some time to think about whether there might be a method that would allow me to extract the files that I need without needing to download the complete zip file. Here, I want to describe a process that allowed me to obtain such an outcome. I’ll describe both a messy and manual initial approach and a subsequent neater Python-based method.

A few disclaimers—mostly applicable to the ‘manual’ approach that I will describe first:

Zip is an old and complicated file format, and aspects of the below are likely to not be robust to the sorts of variation in zip files that one would encounter.
I have only a superficial understanding of zip file structure.
I have only used Linux.
Some of the commands and code that I describe below cause changes to the local filesystem. Be careful if you use them in your own context—data loss may occur!

The problem

There is a zip archive in a file called data.zip that is about 250 GB in size (specifically, 253115123482 bytes). The file is stored on a remote server that I can access via sftp. I know that there is a file in the archive called image.png. My goal is to obtain a local copy of image.png without having to download the entire data.zip file.

Just download some of it?

My first thought was that if I could find out where in the archive the (compressed) file content was located, then maybe I could save some time by stopping the download once it had exceeded that number of bytes and then somehow extract the file from the incomplete zip file. The file locations are stored in the archive’s central directory. Unfortunately, the central directory is located at the end of the archive—if I have to download the whole thing to work out where the file is, then that goes against the whole point of the exercise (to avoid downloading the whole thing)!

Maybe there is another way to obtain the central directory…

Obtaining the central directory

I was glad to see that sftp allows for incomplete downloads to be resumed—meaning that I could stop and resume the download as desired, without having to start all over again. That led me to think that if I could mimic actually having downloaded all but the last section of the zip file, then maybe I could use this resumption capability to obtain the central directory without having to download much data at all.

Let’s allow 50 MB at the end of the zip file to be able to safely contain the central directory. To use the resumption capability as a way of seek-ing, we need to generate a local file that has a size that is 50 MB less than the size of the zip file. We can use the fallocate command to quickly create a file of a given size. Here, we will call it partial_end.zip and we want it to have a size of 253115123482 − 50000000 = 253065123482 bytes. We can achieve this via:

$ fallocate -l 253065123482 partial_end.zip

Now we can use the reget (“resume get”) command while in an sftp session on the server to download the last 50 MB of the zip file:

sftp> reget data.zip partial_end.zip

Now we have, in partial_end.zip, a local zip file containing the central directory. We will use the zipfile built-in Python module to access the central directory. In particular, we will use it to print out the location of our target file (image.png) and its compressed size in the archive.

import zipfile

target_file = "image.png"

with zipfile.ZipFile(file="partial_end.zip") as zip_handle:

    file_info = zip_handle.getinfo(name=target_file)

print(f"Archive location: {file_info.header_offset}")
print(f"Size in archive: {file_info.compress_size}")

Which gives us:

Archive location: 53679216658
Size in archive: 333825

Retrieving a single file

From the above, we have found out that our target file is located about 53 GB into the zip archive and is about 330 kB in size. To retrieve this data, we can use a similar strategy as before by first generating an empty file (here called partial_target.zip) that has a size that is just smaller than the target file location:

$ fallocate -l 53679216658 partial_target.zip

Now we can again use the reget command while in an sftp session on the server to start downloading from the location of the target file in the archive:

sftp> reget data.zip partial_target.zip

We know that the file only occupies a few hundred kB in the zip archive, so we can manually stop it downloading after a short period.

To extract the file from the archive, we can build on the Python code that we wrote earlier. The key aspect is that we point the zip file handler to our new partial_target.zip after the central directory has been initialised. That way, it knows both the structure of the archive and has the relevant parts of the data to be able to extract our target file.

import zipfile

target_file = "image.png"

with zipfile.ZipFile(file="partial_end.zip") as zip_handle:

    file_info = zip_handle.getinfo(name=target_file)

    with open("partial_target.zip", "rb") as partial_handle:

        # close the reference to `partial_end.zip`
        zip_handle.fp.close()
        # point instead towards the reference to `partial_target.zip`
        zip_handle.fp = partial_handle

        zip_handle.extract(member=file_info)

It seems a bit dubious to reassign the fp attribute in this way. However, it is a convention in Python that attributes that are named without one or more leading underscores (e.g., _fp) are ‘public’—so maybe it is not so bad.

So we now have our target file, image.png, available on our local filesystem—with only having to transfer about 0.02% of the data, compared to having to download the whole data.zip file!

We can check the integrity of the extracted target file by comparing its CRC (Cylic Redundancy Check) against the expected CRC that is stored in the zip file’s central directory:

import zipfile
import pathlib
import binascii

target_file = "image.png"

with zipfile.ZipFile(file="partial_end.zip") as zip_handle:

    file_info = zip_handle.getinfo(name=target_file)

    with open("partial_target.zip", "rb") as partial_handle:

        # close the reference to `partial_end.zip`
        zip_handle.fp.close()
        # point instead towards the reference to `partial_target.zip`
        zip_handle.fp = partial_handle

        zip_handle.extract(member=file_info)

    local_CRC = binascii.crc32(pathlib.Path(target_file).read_bytes())
    assert local_CRC == file_info.CRC

Abstracting away the details

The above process worked, but it required a lot of manual steps that are prone to error and would not scale gracefully to needing to access multiple files. Ideally, we can abstract away the details of how the process works and construct an interface that allows for user interaction at a more meaningful level.

In this case, it turns out that this also permits a much neater internal approach!

To perform such an abstraction, we will create a Python class to represent the handling of the remote zip file: RemoteZipHandler. We will delegate the handling of the remote connection to the paramiko library, and require that the user provide an instance of its SFTPClient. The user will also need to provide the remote location of the relevant zip file.

It turns out that paramiko offers an interface with more functionality than the sftp console application. In particular, we can use its SFTPClient.open method to represent the remote zip file as a file-like object and then pass this directly to the zipfile.ZipFile constructor—and it will be able to randomly-access locations in the remote zip file without needing to mess around with creating and resuming partial files! That is very cool and makes the process much more elegant.

Here is my implementation of the RemoteZipHandler class. It defines __enter__ and __exit__ “dunder” methods so that it can be used as a context manager to ensure that it is cleaned up appropriately, and has methods to allow the user to obtain information about all the files in the archive (get_file_list) and to extract one or more files (extract).

import zipfile

class RemoteZipHandler:

    def __init__(self, sftp_client, remote_zip_path):
        """
        `sftp_client` is an instance of `paramiko.sftp_client.SFTPClient`
        """

        self._remote_handle = sftp_client.open(filename=remote_zip_path)

        self._zip_handle = zipfile.ZipFile(file=self._remote_handle)

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self.close()

    def close(self):
        self._zip_handle.close()
        self._remote_handle.close()

    def extract(self, paths):

        if isinstance(paths, str):
            paths = [paths]

        for path in paths:
            self._zip_handle.extract(member=path)

    def get_file_list(self):

        return [
            zip_info.filename
            for zip_info in self._zip_handle.filelist
        ]

Here is how it can be used to extract image.png from the data.zip file that we have been considering:

with RemoteZipHandler(
    sftp_client=sftp_client,
    remote_zip_path="data.zip",
) as remote_zip_handler:

    remote_zip_handler.extract("image.png")

Overall, this is a much simpler and more elegant approach. Importantly, the problem has been solved and the goal achieved—I can extract the relevant file from the archive without needing to download the entire zip file.