LDC #123: Comparing Two Zip Files

Friday, February 15. 2019

LDC #123: Comparing Two Zip Files

Many times we developers download tools and updates to specifications in zip files. Figuring out which files have changed between versions can be a pain. For this blog, we are going to create a simple script to help compare the contents of two different zip files. Instead of having to check all the files, you can run this script and have it print a list of the files that need to be reviewed.

In order to do the comparison we need to first extract both containers. As a note we could adapt this script to work with any container format but zip support is built into Legato and so is a good choice. If we want the comparison to be complete, we will then need to extract any zip files that were inside of the zip files we’re comparing. The SEC’s technical specifications contain many nested zip files. After everything is completely extracted we will need to build a list of files from both zips and then compare them. Any files that have the same relative paths will then need to be hashed to detect if they are the same file. Lastly, we can print a report of all the files.

It is important to note that most container formats actually do contain date and time information for each file so we could check the modified times to run our comparison and look for changes. However, these values can be really unreliable. For example, a zip only has time accurate to two seconds. Instead of gambling with times, we can just analyze the files themselves. If you want to learn more about hashing files, you can check out my blog on hashing here. For simplicity this blog is going to use the MD5 hashing algorithm, but we could easily adapt the script to use the SHA256 algorithm from the previous post.

So breaking down the parts, we will need a function that extracts our files, a function to hash a file, and a function to compare directories. We will also create a main function to run our compare directories function so we can test that things are working. We likely could have made this a single function but splitting it into multiple functions allows us to make edits down the road. Let’s start with our main function.


void main() {

    string      s1, s2;
    int         rc;
    
    s1 = AddPaths(GetTempFileFolder(), "zip_base");
    s2 = AddPaths(GetTempFileFolder(), "zip_cmp");
    
    RecycleFile(s1);
    RecycleFile(s2);
    
    rc = ZipExtractToFolder(OLD_ZIP, s1);
    if (IsError(rc)) {
      MessageBox('X', "Could not extract base zip file (0x%08x)", rc);
      return;
      }
    rc = ZipExtractToFolder(NEW_ZIP, s2);
    if (IsError(rc)) {
      MessageBox('X', "Could not extract new zip file (0x%08x)", rc);
      return;
      }

    compare_directories(s1, s2, true);

    RecycleFile(s1);
    RecycleFile(s2);
    }

We are going to start by creating two folders in the Windows temporary files location. Because we want to make the script possibly work with other container formats, it will be easier if we extract the zip files rather than using Legato’s functions to access single files inside a zip file. We can use the temporary area to make sure we don’t overwrite any important files on the user’s computer. We first delete the directories using the RecycleFile function. We don’t check the error because we don’t care if these directories didn’t exist. This gives us a clean slate every time the script is run. Next we extract the zip files directly into those folders using the ZipExtractToFolder function. There are two things to note here. First, in my overall design above, I suggested creating a function to extract the containers, but we aren’t going to do that just yet. Second, we are using the defines OLD_ZIP and NEW_ZIP in the complete script below to identify our zip files. The defines are a shortcut for the blog post since adding a UI is not the focus here. Consider it a challenge if you wish to use the script for yourself. We then call our compare_directories function to compare the extracted files. Lastly, we delete the folders to save disk space.

That is the main function. As you can see, all it does is set up our comparison. Before we move to the comparison function, let’s create the extract function we talked about above. This function is going to take a zip file name and extract the contents of the zip into a folder named the same thing as the zip file. If a folder with that name already exists, we can assume the zip file has already been extracted. This isn’t entirely a safe assumption since a zip file could contain a folder and a zip file with the same name. But, for our purposes, this is enough of a safety net.


void extract_zip(string fn) {

    string      dest;
    int         rc;
    
    dest = ClipFileExtension(fn);
    if (DoesPathExist(dest)) {
      return;
      }
    AddMessage("    Extracting %s...", fn);
    rc = ZipExtractToFolder(fn, dest);
    if (IsError(rc)) {
      AddMessage("      Could not extract %s (0x%08x)", fn, rc);
      return;
      }
    }

We start by creating the destination folder name using the ClipFileExtension function. We then check to see if the path exists using the DoesPathExist function. If that path already exists, we simply exit, as discussed above. Otherwise we log that we are extracting a file and then extract the contents using the ZipExtractToFolder function.

Now that we have extract_zip out of the way we can start to talk about the function that does most of the work, compare_directories.


void compare_directories(string base, string cmp) {

    string      results[][];
    string      files_b[];
    string      files_c[];
    string      zips[];
    string      name;
    int         bx,
                cx,
                rx,
                cmp_count,
                results_count,
                base_count;

    AddMessage("Extracting Zip Files...");

    // Process Base Zips
    base_count = 0;
    zips = EnumerateFiles(AddPaths(base, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    while (ArrayGetAxisDepth(zips) != base_count) {
      base_count = ArrayGetAxisDepth(zips);
      for (bx = 0; bx < base_count; bx++) {
        extract_zip(AddPaths(base, zips[bx]));
        }
      zips = EnumerateFiles(AddPaths(base, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
      }

    // Process CMP Zips
    cmp_count = 0;
    zips = EnumerateFiles(AddPaths(cmp, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    while (ArrayGetAxisDepth(zips) != cmp_count) {
      cmp_count = ArrayGetAxisDepth(zips);
      for (cx = 0; cx < cmp_count; cx++) {
        extract_zip(AddPaths(cmp, zips[cx]));
        }
      zips = EnumerateFiles(AddPaths(cmp, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
      }

We need to extract all the zip files that might be inside of our zip files that we’re comparing. In order to do this, we can get a list of all the zip files using the EnumerateFiles function and then extract them using extract_zip. Afterwards, we can see if the number of zip files in the directory tree increased. If it did, we need to extract more. If it has not increased, we have extracted all the zip files. This is why the extract_zip function doesn’t extract the zip file if there is already a directory with the correct name. Alternatively, we could keep an array of zip files that have been processed, but this seemed like a simple solution.


    // Build File Lists
    AddMessage("Processing...");
    files_b = EnumerateFiles(AddPaths(base, "*.*"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    base_count = ArrayGetAxisDepth(files_b);
    files_c = EnumerateFiles(AddPaths(cmp, "*.*"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    cmp_count = ArrayGetAxisDepth(files_c);
    
    // Add to results
    for (bx = 0; bx < base_count; bx++) {
      results[results_count]["partial"] = files_b[bx];
      results[results_count]["source"] = "old";
      results_count++;
      }
    for (cx = 0; cx < cmp_count; cx++) {
      for (rx = 0; rx < results_count; rx++) {
        if (results[rx]["partial"] == files_c[cx]) {
          break;
          }
        }
      if (rx == results_count) {
        results[results_count]["partial"] = files_c[cx];
        results[results_count]["source"] = "new";
        results_count++;
        }
      else {
        results[rx]["source"] = "both";
        }
      }

Now that we have all the files extracted, we can build a list of files. We get an array of file names from both folders. Then we start off by adding all the files in the base directory to our results list while marking the source as “old”. Now we can look at the files from the compare directory. For each file in the compare directory we check if a file with that name is already in the results list. Keep in mind that the EnumerateFiles function gave us partial paths so file names inside the base and compare lists are guaranteed to be unique. If we hit a duplicate name it means the file is in both directories. If it was not in the results list it is a new file.


    // Build Hashes
    for (rx = 0; rx < results_count; rx++) {
      if (results[rx]["source"] != "both") {
        continue;
        }
      AddMessage("    Hashing Files %s", results[rx]["partial"]);
      results[rx]["hash_base"] = hash_file(AddPaths(base, results[rx]["partial"]));
      results[rx]["hash_cmp"] = hash_file(AddPaths(cmp, results[rx]["partial"]));
      if (results[rx]["hash_base"] != results[rx]["hash_cmp"]) {
        results[rx]["source"] = "diff";
        }
      }

At this point, we have an array of files with a name and a source. If the source is “both” we need to analyze the file to see if it has changed between the zip files. We loop over the files, and if the source is not “both”, we go to the next file. We then hash files using our hash_file function. We still haven’t defined this function but all we need to know for now is that it returns a string representing a hash of the file’s content. We then compare the hashes. If they are different, we change the source to “diff”. This allows us to sort the table later using the source column.


    SortTable(results, 0, "source", "partial");

    // Print Results
    AddMessage("Results:");
    for (rx = 0; rx < results_count; rx++) {
      name = "";
      if (results[rx]["source"] == "old") {
        name = "REMOVED";
        }
      if (results[rx]["source"] == "new") {
        name = "ADDED";
        }
      if (results[rx]["source"] == "both") {
        name = "SAME";
        }
      if (results[rx]["source"] == "diff") {
        name = "CHANGED";
        }
      AddMessage("    %-7s: %s", name, results[rx]["partial"]);
      }
    }

Now our results array contains a name, source, and possibly hashes for each source file. We can now sort this list using the SortTable function based on the source key as well as the partial key. This groups all the files based on whether they are only in one zip, the same in both, or different, and when the source is the same, it the sorts by name. Now that our list is organized, we can print it out. If the source is “old”, the file is only in the base zip so we mark it as “REMOVED”. If the source is “new”, the file is only in the new zip, so we mark it as “ADDED”. If the source is “both”, the file was in both zips. For these cases, if the hash of the files was the same from both zips, we mark the file as “SAME”. Lastly, if the source is “diff”, the file was in both zips but the hashes did not match so we mark the file as “CHANGED”. We then add the name and text to our log.

Now that the most complex function is done, we can talk about the hash_file function.


string hash_file(string fn) {
    handle file;
    string res;
    
    file = OpenFile(fn, FO_READ | FO_SHARE_READ);
    if (IsError(file)) {
      AddMessage("      Couldn't open file. %s", GetLastErrorMessage());
      return "";
      }

    res = MD5CreateDigest(file);
    
    CloseHandle(file);
    return res;
    }

This function is pretty straight forward. We open the file we want to hash. If we can’t open the file, we log an error and return an empty string for the hash. If we can open the file, we hash the file using the MD5CreateDigest function. We could easily replace this call with any other hashing algorithm. We then close the file using the CloseHandle function and return the string. Nice and simple.

Here is a complete copy of the script with the defines already filled out. If you run this scrip it will compare the Form ATS-N technical specification 1.0 zip file on the SEC’s website with the 1.1 draft specification. As you can see, with Legato a little bit of development time can save you hours of comparing files.




#define OLD_ZIP "https://www.sec.gov/info/edgar/specifications/form-ats-n-xml-1.0.zip"
#define NEW_ZIP "https://www.sec.gov/info/edgar/specifications/form-ats-n-xml-1.1_d.zip"

void    main                            ();
void    extract_zip                     (string fn);
void    compare_directories             (string base, string cmp, boolean recurse);
string  hash_file                       (string fn);

void main() {

    string      s1, s2;
    int         rc;
    
    s1 = AddPaths(GetTempFileFolder(), "zip_base");
    s2 = AddPaths(GetTempFileFolder(), "zip_cmp");
    
    RecycleFile(s1);
    RecycleFile(s2);
    
    rc = ZipExtractToFolder(OLD_ZIP, s1);
    if (IsError(rc)) {
      MessageBox('X', "Could not extract base zip file (0x%08x)", rc);
      return;
      }
    rc = ZipExtractToFolder(NEW_ZIP, s2);
    if (IsError(rc)) {
      MessageBox('X', "Could not extract new zip file (0x%08x)", rc);
      return;
      }

    compare_directories(s1, s2, true);

    RecycleFile(s1);
    RecycleFile(s2);
    }

void extract_zip(string fn) {

    string      dest;
    int         rc;
    
    dest = ClipFileExtension(fn);
    if (DoesPathExist(dest)) {
      return;
      }
    AddMessage("    Extracting %s...", fn);
    rc = ZipExtractToFolder(fn, dest);
    if (IsError(rc)) {
      AddMessage("      Could not extract %s (0x%08x)", fn, rc);
      return;
      }
    }

void compare_directories(string base, string cmp) {

    string      results[][];
    string      files_b[];
    string      files_c[];
    string      zips[];
    string      name;
    int         bx,
                cx,
                rx,
                cmp_count,
                results_count,
                base_count;

    AddMessage("Extracting Zip Files...");

    // Process Base Zips
    base_count = 0;
    zips = EnumerateFiles(AddPaths(base, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    while (ArrayGetAxisDepth(zips) != base_count) {
      base_count = ArrayGetAxisDepth(zips);
      for (bx = 0; bx < base_count; bx++) {
        extract_zip(AddPaths(base, zips[bx]));
        }
      zips = EnumerateFiles(AddPaths(base, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
      }

    // Process CMP Zips
    cmp_count = 0;
    zips = EnumerateFiles(AddPaths(cmp, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    while (ArrayGetAxisDepth(zips) != cmp_count) {
      cmp_count = ArrayGetAxisDepth(zips);
      for (cx = 0; cx < cmp_count; cx++) {
        extract_zip(AddPaths(cmp, zips[cx]));
        }
      zips = EnumerateFiles(AddPaths(cmp, "*.zip"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
      }

    // Build File Lists
    AddMessage("Processing...");
    files_b = EnumerateFiles(AddPaths(base, "*.*"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    base_count = ArrayGetAxisDepth(files_b);
    files_c = EnumerateFiles(AddPaths(cmp, "*.*"), FOLDER_LOAD_RECURSE | FOLDER_LOAD_NO_FOLDER_NAV);
    cmp_count = ArrayGetAxisDepth(files_c);
    
    // Add to results
    for (bx = 0; bx < base_count; bx++) {
      results[results_count]["partial"] = files_b[bx];
      results[results_count]["source"] = "old";
      results_count++;
      }
    for (cx = 0; cx < cmp_count; cx++) {
      for (rx = 0; rx < results_count; rx++) {
        if (results[rx]["partial"] == files_c[cx]) {
          break;
          }
        }
      if (rx == results_count) {
        results[results_count]["partial"] = files_c[cx];
        results[results_count]["source"] = "new";
        results_count++;
        }
      else {
        results[rx]["source"] = "both";
        }
      }

    // Build Hashes
    for (rx = 0; rx < results_count; rx++) {
      if (results[rx]["source"] != "both") {
        continue;
        }
      AddMessage("    Hashing Files %s", results[rx]["partial"]);
      results[rx]["hash_base"] = hash_file(AddPaths(base, results[rx]["partial"]));
      results[rx]["hash_cmp"] = hash_file(AddPaths(cmp, results[rx]["partial"]));
      if (results[rx]["hash_base"] != results[rx]["hash_cmp"]) {
        results[rx]["source"] = "diff";
        }
      }
      
    SortTable(results, 0, "source", "partial");

    // Print Results
    AddMessage("Results:");
    for (rx = 0; rx < results_count; rx++) {
      name = "";
      if (results[rx]["source"] == "old") {
        name = "REMOVED";
        }
      if (results[rx]["source"] == "new") {
        name = "ADDED";
        }
      if (results[rx]["source"] == "both") {
        name = "SAME";
        }
      if (results[rx]["source"] == "diff") {
        name = "CHANGED";
        }
      AddMessage("    %-7s: %s", name, results[rx]["partial"]);
      }
    }
    
string hash_file(string fn) {
    handle file;
    string res;
    
    file = OpenFile(fn, FO_READ | FO_SHARE_READ);
    if (IsError(file)) {
      AddMessage("      Couldn't open file. %s", GetLastErrorMessage());
      return "";
      }

    res = MD5CreateDigest(file);
    
    CloseHandle(file);
    return res;
    }

David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission.