LDC #96: Data Compression - Part I

Skip to blog entries
Skip to archive page
Skip to right sidebar

Friday, August 03. 2018

LDC #96: Data Compression - Part I

Various forms of data compression are employed all over the computing world. In many instances, you may need to read compressed data or create your own. This article is the first in a series about data compression and the tools available within Legato. We will start with an overview and then dive into the ubiquitous zip method by creating a little program to zip a project.

Introduction

Various types of data compression are used everywhere within digital processing. Most audio files, most images, video, communications and many file formats are compressed. Native Microsoft file formats, such as docx, xlsx, pptx, etc., are all compressed. Prior to the SEC eliminating their XBRL previewer, files had to be submitted as compressed ‘zip’ files. The list goes on.

But what is data compression? Basically, it’s a method of removing redundant information from a data stream and later rebuilding that data stream for consumption. Some compression methods are extremely data specific. For example, JPEG compression employs a method of removing detail from images that cannot necessarily be perceived by the human eye. The degree of compression can be adjusted with each adjustment resulting in an image of less quality. Compression can be 10:1 or more but it is what is known as a ‘lossy’ compression. Similarly, MPEG for video (and audio with MP3) also employs a lossy compression with the latest method being H.265, which combines many of the previous MPEG features by only occasionally adding complete key frames. It then builds on the frames by sending the differences. Without this technology, it would be nearly impossible to support Netflix, Amazon Prime, YouTube and other video streaming services over the Internet. Furthermore, your 32gb camera card would rapidly overflow.

Still, as we said, this is lossy compression, meaning the output is intended for a very specific purpose. That does not work when we need to restore compressed data and send as files and folder via email. For that, we need lossless compression.

Lossless compression basically comes down to this: whatever the input data is prior to compression, the decompressed output must be exactly the same. Perhaps the simplest and earliest compression method developed is Run Length Encoding (RLE). It is very simple: there are 200 bytes of the value zero. Rather than writing 200 zeros, the repeated information is detected and an escape code, size, and data (0) is written. Now I have only 3 bytes written. This is pretty cool and it provides a lot of compression for representing a simple image of a logo or a chart. This is one of the methods the GIF image format employs. However, it’s not so good for HTML, XML, databases, and other files. More universal lossless compression schemes were later developed to search for and replace repeated multiple byte segments including bit level compression. This eventually led to the DEFLATE algorithm.

For general compression, ‘zlib’ (developed by Jean-loup Gailly and Mark Adler with the first version released 1995) is arguably the most commonly used library. It is incorporated in many, many systems, including Windows, Linux, Mac, PlayStation, Wii, and Xbox. The most common bundling of files using DEFLATE is the zip format originated by Phil Katz in the program PKZIP. There are many other similar formats such as GZIP, TAR and 7Z.

DEFLATE is supported in Legato via zlib. Many functions discussed in this series will be zlib based. Zip format supports a number of compression formats. However, Legato only supports DEFLATE and STORE types, which are the most common.

You might be asking yourself: if I need to get the file small, why not compress and compress again and again? Of course, there is no such thing as a free lunch. If you grab a PNG, JPG, or MP3 and zip it, it will actually get a little bigger owning to the overhead of managing the container. Most multiple file compression schemes will detect such files and simply store the data as is. For higher compression ratios, programs like 7-zip consume multiple files of similar types to build more efficient compression tables at the cost of processing time.

Multiple File Compression and the Zip Object

The Legato SDK contains a series of functions that allow for the processing of zip files. Many of the functions use the Legato Zip Object, which can be used to create or extract zip data (the object is not meant to actively edit a zip file.

To create a Zip Object:

handle = ZipCreate ( );

The function returns a handle that can be used to then add files:

int = ZipAddFile ( handle hZip, string source, [string path] );

Files can be added in succession by specifying the created handle and a source path or URI. As the files are added, the name component of the source is used as the entry name. An optional path parameter allows entries to be organized into a directory tree. (Note that the path parameter is available after version 1.1m.)

After all the files have been added, the Zip Object can be written by using the ZipWrite function:

int = ZipWrite ( handle hZip, string source );

Each file added must allow read access during the zip write process since they are compressed at that time.

Here is a practical application, a script to zip a project to allow it to be emailed:

    handle              hZip;
    string              s1, s2;
    string              list[];
    int                 ix, size;
    int                 rc;

                                                                // Set Up the Project
    s1 = ProjectGetName();
    if (s1 == "") {
      MessageBox('i',  "A project needs to be open to zip.");
      exit;
      }

    list = EnumerateEditFiles(TRUE);
    rc = FindInList(list, s1, FIND_NO_CASE);

    if ((ProjectGetModifiedStatus() != 0) || (rc >= 0)) {
      MessageBox('x',  "Save the project to continue.");
      exit;
      }

    s2 = ClipFileExtension(s1) + ".zip";
    rc = QueryOverwrite(s2);
    if (IsCancel(rc)) { return rc; }
    if (IsError(rc)) {
      MessageBox('x',  "Error 0x%08X accessing file.\r\r%s", rc, s2);
      exit;
      }

                                                                // Set Up the Zip Object
    hZip = ZipCreate();
                                                                // Add Project File
    ZipAddFile(hZip, s1);
                                                                // Add Project Files
    size = ProjectGetEntryCount();
    while (ix < size) {
      s1 = ProjectGetEntry(ix, TRUE);
      rc = FindInList(list, s1, FIND_NO_CASE);
      if (rc >= 0) { break; }
      ZipAddFile(hZip, s1);
      ix++;
      }
                                                                // Stopped Because of Modified
    if (ix != size) {
      MessageBox('x',  "Save modified files for the project to continue.\r\r'%s'",
                 GetFilename(s1));
      exit;
      }

                                                                // Write to Project Folder
    rc = ZipWrite(hZip, s2); 
    if (IsError(rc)) {
      MessageBox('x',  "Error 0x%08X creating zip file.\r\r%s", rc, s2);
      exit;
      }    

    MessageBox('i', "Project Zipped!\r\r%s", s2);

The above example could also be made into a hook or tool and added to the ribbon. The script is basically in three sections: project setup, adding files, and writing the zip file.

The project setup is mostly error checking conditions such as ‘is there a project?’, ‘is it modified?’ and ‘will zipping it to this file overwrite the last zip file?’ The modified status of edit windows is important since such modifications have not been recorded back to the edit files. To test that, we get a list of modified windows and simply scan the list for the project and later each project file. The project itself is the first file added to the Zip Object.

The second part is getting a count of project entries and enumerating them directly into the Zip Object. While adding them, they, too, are checked for unsaved modifications.

Finally, the Zip Object is written. It is important to check for errors on this function because it performs the heavy lifting with the possibility of file errors during the compression process and the eventual writing of the zip file.

Conclusion

There are a lot of applications for creating zips. The above example could be used to archive projects. It could be further modified to enumerate the folder contents, capturing all the data including project entries. Of course, enumerating the project entries would also bring in files that were externally referenced.

On my next blog on this subject, we will examine reading zip files.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.