Extracting files from a remote zip archive
Here, I describe the process of figuring out how to extract files from a remote zip archive without having to download the entire remote zip file.
I recently needed a few files from a zip file that was located on a remote server. The zip file was about 250 GB, and downloading it to my local computer would take about 60 hours. Having seemingly no other alternative, I started the download and was resigned to having to use a lot of computer time and bandwidth over the next few days.
However, this also gave me some time to think about whether there might be a method that would allow me to extract the files that I need without needing to download the complete zip file. Here, I want to describe a process that allowed me to obtain such an outcome. I’ll describe both a messy and manual initial approach and a subsequent neater Python-based method.
A few disclaimers—mostly applicable to the ‘manual’ approach that I will describe first:
- Zip is an old and complicated file format, and aspects of the below are likely to not be robust to the sorts of variation in zip files that one would encounter.
- I have only a superficial understanding of zip file structure.
- I have only used Linux.
- Some of the commands and code that I describe below cause changes to the local filesystem. Be careful if you use them in your own context—data loss may occur!
The problem
There is a zip archive in a file called data.zip
that is about 250 GB in size (specifically, 253115123482 bytes).
The file is stored on a remote server that I can access via sftp
.
I know that there is a file in the archive called image.png
.
My goal is to obtain a local copy of image.png
without having to download the entire data.zip
file.
Just download some of it?
My first thought was that if I could find out where in the archive the (compressed) file content was located, then maybe I could save some time by stopping the download once it had exceeded that number of bytes and then somehow extract the file from the incomplete zip file. The file locations are stored in the archive’s central directory. Unfortunately, the central directory is located at the end of the archive—if I have to download the whole thing to work out where the file is, then that goes against the whole point of the exercise (to avoid downloading the whole thing)!
Maybe there is another way to obtain the central directory…
Obtaining the central directory
I was glad to see that sftp
allows for incomplete downloads to be resumed—meaning that I could stop and resume the download as desired, without having to start all over again.
That led me to think that if I could mimic actually having downloaded all but the last section of the zip file, then maybe I could use this resumption capability to obtain the central directory without having to download much data at all.
Let’s allow 50 MB at the end of the zip file to be able to safely contain the central directory.
To use the resumption capability as a way of seek
-ing, we need to generate a local file that has a size that is 50 MB less than the size of the zip file.
We can use the fallocate
command to quickly create a file of a given size.
Here, we will call it partial_end.zip
and we want it to have a size of 253115123482 − 50000000 = 253065123482
bytes.
We can achieve this via:
$ fallocate -l 253065123482 partial_end.zip
Now we can use the reget
(“resume get”) command while in an sftp
session on the server to download the last 50 MB of the zip file:
sftp> reget data.zip partial_end.zip
Now we have, in partial_end.zip
, a local zip file containing the central directory.
We will use the zipfile
built-in Python module to access the central directory.
In particular, we will use it to print out the location of our target file (image.png
) and its compressed size in the archive.
import zipfile
target_file = "image.png"
with zipfile.ZipFile(file="partial_end.zip") as zip_handle:
file_info = zip_handle.getinfo(name=target_file)
print(f"Archive location: {file_info.header_offset}")
print(f"Size in archive: {file_info.compress_size}")
Which gives us:
Archive location: 53679216658
Size in archive: 333825
Retrieving a single file
From the above, we have found out that our target file is located about 53 GB into the zip archive and is about 330 kB in size.
To retrieve this data, we can use a similar strategy as before by first generating an empty file (here called partial_target.zip
) that has a size that is just smaller than the target file location:
$ fallocate -l 53679216658 partial_target.zip
Now we can again use the reget
command while in an sftp
session on the server to start downloading from the location of the target file in the archive:
sftp> reget data.zip partial_target.zip
We know that the file only occupies a few hundred kB in the zip archive, so we can manually stop it downloading after a short period.
To extract the file from the archive, we can build on the Python code that we wrote earlier.
The key aspect is that we point the zip file handler to our new partial_target.zip
after the central directory has been initialised.
That way, it knows both the structure of the archive and has the relevant parts of the data to be able to extract our target file.
import zipfile
target_file = "image.png"
with zipfile.ZipFile(file="partial_end.zip") as zip_handle:
file_info = zip_handle.getinfo(name=target_file)
with open("partial_target.zip", "rb") as partial_handle:
# close the reference to `partial_end.zip`
zip_handle.fp.close()
# point instead towards the reference to `partial_target.zip`
zip_handle.fp = partial_handle
zip_handle.extract(member=file_info)
It seems a bit dubious to reassign the fp
attribute in this way.
However, it is a convention in Python that attributes that are named without one or more leading underscores (e.g., _fp
) are ‘public’—so maybe it is not so bad.
So we now have our target file, image.png
, available on our local filesystem—with only having to transfer about 0.02% of the data, compared to having to download the whole data.zip
file!
We can check the integrity of the extracted target file by comparing its CRC (Cylic Redundancy Check) against the expected CRC that is stored in the zip file’s central directory:
import zipfile
import pathlib
import binascii
target_file = "image.png"
with zipfile.ZipFile(file="partial_end.zip") as zip_handle:
file_info = zip_handle.getinfo(name=target_file)
with open("partial_target.zip", "rb") as partial_handle:
# close the reference to `partial_end.zip`
zip_handle.fp.close()
# point instead towards the reference to `partial_target.zip`
zip_handle.fp = partial_handle
zip_handle.extract(member=file_info)
local_CRC = binascii.crc32(pathlib.Path(target_file).read_bytes())
assert local_CRC == file_info.CRC
Abstracting away the details
The above process worked, but it required a lot of manual steps that are prone to error and would not scale gracefully to needing to access multiple files. Ideally, we can abstract away the details of how the process works and construct an interface that allows for user interaction at a more meaningful level.
In this case, it turns out that this also permits a much neater internal approach!
To perform such an abstraction, we will create a Python class to represent the handling of the remote zip file: RemoteZipHandler
.
We will delegate the handling of the remote connection to the paramiko
library, and require that the user provide an instance of its SFTPClient
.
The user will also need to provide the remote location of the relevant zip file.
It turns out that paramiko
offers an interface with more functionality than the sftp
console application.
In particular, we can use its SFTPClient.open
method to represent the remote zip file as a file-like object and then pass this directly to the zipfile.ZipFile
constructor—and it will be able to randomly-access locations in the remote zip file without needing to mess around with creating and resuming partial files!
That is very cool and makes the process much more elegant.
Here is my implementation of the RemoteZipHandler
class.
It defines __enter__
and __exit__
“dunder” methods so that it can be used as a context manager to ensure that it is cleaned up appropriately, and has methods to allow the user to obtain information about all the files in the archive (get_file_list
) and to extract one or more files (extract
).
import zipfile
class RemoteZipHandler:
def __init__(self, sftp_client, remote_zip_path):
"""
`sftp_client` is an instance of `paramiko.sftp_client.SFTPClient`
"""
self._remote_handle = sftp_client.open(filename=remote_zip_path)
self._zip_handle = zipfile.ZipFile(file=self._remote_handle)
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.close()
def close(self):
self._zip_handle.close()
self._remote_handle.close()
def extract(self, paths):
if isinstance(paths, str):
paths = [paths]
for path in paths:
self._zip_handle.extract(member=path)
def get_file_list(self):
return [
zip_info.filename
for zip_info in self._zip_handle.filelist
]
Here is how it can be used to extract image.png
from the data.zip
file that we have been considering:
with RemoteZipHandler(
sftp_client=sftp_client,
remote_zip_path="data.zip",
) as remote_zip_handler:
remote_zip_handler.extract("image.png")
Overall, this is a much simpler and more elegant approach. Importantly, the problem has been solved and the goal achieved—I can extract the relevant file from the archive without needing to download the entire zip file.