LDC #130: Synchronizing Meta Data

Friday, April 05. 2019

LDC #130: Synchronizing Meta Data

This week’s blog post is going to be an example of using Legato to analyze a directory and a csv file, and to update a database with the results. While this script is specialized for a personal website and may have a lot of practical application outside of similar websites, it showcases many of the concepts I want to cover in this blog and illustrates the ability of Legato to bring data together.

I wrote this script because I have a website that contains thousands of images. I also have a CSV file that contains meta information about these images. If I want to display that meta information on the website, I need to put it in a web safer format such as a MySQL database, which is not too challenging. But then I am faced with a new issue: the CSV file is updated regularly from an external application. How do I keep the file meta data from the CSV in sync with the database on the website?

In the following example, I have specific image files that have a particular naming scheme as well as a CSV file that contains additional information about the images. I’ll explain my naming scheme and how the CSV is structured, and, while this is critical to the way my script works, you can adapt the script to suit any naming scheme that you choose to use.

So first, let’s talk about the image files. Each one is in a sub-directory that is the “name” of the image. Each image file is named in a specific format that contains a unique identifier as well as a descriptor for the image. Second, we also have a CSV file that contains more information about the images, such as the source. We want to bring both of these together. For example, an image file may be named “JohnDoe\00011 Dance.png” which contains a picture I took of John Doe at his senior dance. This would indicate that the image is of “John Doe” participating in an activity “Dance”. Inside the CSV file, there is a row with the ID ‘0001’ that indicates the source of the image, “School”, as well as other meta data. The other ‘1’ in the name is a grouping number, so the website can group similar photos together.

Now that you have a basic understanding of how the images are named and how the CSV meta data is structured, let’s discuss the steps we need to accomplish. In order to use all of the information we need to do a few things:

1. Connect to the Database.

2. Read any state information needed.

3. Read the CSV file of meta data.

4. Create a list of all the image files.

5. Iterate over the images and combine the meta data with the image data.

6. Process any new or changed images and update the database.

7. Save any state information for next time.

Since this is designed to be run only when needed, the script file will be a standalone function. Let’s start with step one. An ODBC connection can get this done for us. Remember when using ODBC the appropriate connector needs to be installed on your computer. For our example, that is the MySQL 32 bit connector.


void main() {


  string files[];
  string table[][];
  string processed[][];
  string parts[];
  string fname, name, id, group, source, ver, author, activity, type, lastmod;
  string line;
  string fn, out;
  int ix, mx, tx, tmx, px, pmx, npmx;
  handle db, res;
  
  db = ODBCConnect("Driver=MySQL ODBC 8.0 Ansi Driver;Server=MyServer.com;Database=MyDB;Uid=MyUser;Pwd=******;");
  if (IsError(db)) {
    AddMessage("Couldn't connect to DB %s", GetLastErrorMessage());
    return;
    }

We start by declaring our variables. These are mostly self explanatory: files is the list of files, table is the table of meta data, processed is the cache of processed information, etc. We then use the ODBCConnect function with the MySQL ODBC driver and our credentials for the server.


  processed = CSVReadTable(AddPaths(GetScriptFolder(), "dbcache.csv"));
  pmx = ArrayGetAxisDepth(processed);
  npmx = pmx;
  SortTable(processed, SORT_ALPHA, 0);

  table = CSVReadTable(AddPaths(GetScriptFolder(), "List.csv"));
  tmx = ArrayGetAxisDepth(table);
  if (tmx == 0) {
    AddMessage("Couldn't Load CSV");
    return;
    }

Now that we have connected to the database, let’s move on to step two, read any state information needed, and step three, read the CSV file of meta data. The processed variable is a cache of items that have been processed in the past. We could read this information out of the database directly but I opted not to do this because our images have an ID and type which means our SQL table lookups would be slow. So instead we have a csv file of database IDs (not the same as the ID in the file name). Since this file may not exist we don’t care if it fails to load. We then get the number of entries in the cache and store that in two different variables, pmx and npmx. This is because we will be adding to this table as our script processes entries. No point in searching through entries we’ve added on this run so we mark the maximum for later. We also sort the table using the SortTable function. This is a crucial step as it will allow us to use a binary search for better performance.

Next we load our meta data CSV file. Unlike our state data, it actually matters if this file fails to load so we check for an error on this one. And with those few lines we move on to step four.


  files = EnumerateFiles(AddPaths(GetScriptFolder(), "*.png;*.gif;*.jpg", FOLDER_LOAD_RECURSE | FOLDER_UTF_NAMES);
  mx = ArrayGetAxisDepth(files);
  if (mx == 0) {
    AddMessage("No Files");
    return;
    }

  out = AddPaths(GetScriptFolder(), "_Staging\");
  RecycleFile(out);
  CreateFolder(out);

This step is very straight forward. We use EnumerateFiles to get a list of all the images in the folder and subfolders. If we don’t have any images then there is nothing to do, so we can stop processing here. I also added some code here to help myself manage uploading the images to the website. We create a “_Staging” folder that the script will copy processed images into. This means we can just upload everything from this folder instead of copying all images or trying to manually figure out which ones to upload. Manually uploading hundreds of files, yikes!

Now we can move on to the more complicated part of our script. We will iterate over all the images and combine our data sources. If you remember from above, we are getting some of the data from the file name (name, id, and group) and some of the data from the CSV file (source and author).


  AddMessage("# Changes");
  ProgressOpen("Checking Images...", 0);  
  for (ix = 0; ix < mx; ix++) {
    ProgressUpdate(ix, mx);
    ProgressSetStatus("Checking %s", files[ix]);

    parts = ExplodeString(files[ix], "\\");
    name = parts[0];
    fname = name;
    ver = "Final";
    name = ReplaceInString(name, " (PH)", "");
    if (IsNotError(GetLastError())) {
      ver = "Placeholder";
      }
    parts = ExplodeString(parts[1], " ", 1);
    id = ReplaceInStringRegex(parts[0], "^([0-9]+)[0-9]$", "$1"); 
    group = ReplaceInStringRegex(parts[0], "^[0-9]+([0-9])$", "$1"); 
    activity = ClipFileExtension(parts[1]);
    type = ReplaceInString(GetExtension(parts[1]), ".", "");
    
    author = "";
    source = "";
    for (tx = 0; tx < tmx; tx++) {
      if (id == table[tx][0]) {
        source = table[tx][2];
        author = table[tx][3];
        break;
        }
      }
    lastmod = GetFileModifiedTimeString(AddPaths(GetScriptFolder(), files[ix]));

The script logs all of the SQL commands. If the ODBC connector has issues, you can always run the commands manually so we start with adding an SQL comment to the log and creating a progress window. Then we enter the loop and update the progress window. We use the ExplodeString function to take apart the file name. Splitting the name on a slash gives us an array of two entries, the first being the name of the image and the second being the ID, group, and activity of the image. If you remember the example above we have now split “JohnDoe\00011 Dance.png” into “JohnDoe” and “00011 Dance.png”. We set name to the first of the exploded parts, we also set fname to the name. We then do some speciality processing on the name, if the name contains the text “(PH)” we mark the version of the image as a placeholder.

We then use ExplodeString again, this time splitting on a space and splitting the result of our first split. Again, this converts “00011 Dance.png” into “00011” and “Dance.png” giving us the id and group as one string and the activity as another. Now we can use regular expressions to set the id and group variables. The id is all the numbers in the first part of the split except the last digit and the group is the last digit. All that remains is the activity variable which is simply the remaining filename without the extension. A call to ClipFileExtension gives us that information. Lastly, we get the type of image, be it PNG, GIF or JPEG, by looking at the extension.

Now that we’ve obtained all the information we can from the image filename, it is time to look at the image’s meta data. To do this we iterate over the table of meta data and find the line that matches our image ID. If we find that line, we set the source and author to the appropriate columns. We can also get one last piece of information, the modified time of the image. For that we use the GetFileModifiedTimeString function. With all the data collected, step five is complete. Let’s move on to step six.


    line = UTFToAnsi(FormatString("INSERT INTO `images` (`name`, `fname`, `id`, `group`, `source`, `author`, `version`, `activity`, `type`, `lastmod`) VALUES (\"%s\",\"%s\",%s,%s,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\");", name, fname, id, group, source, author, ver, activity, type, lastmod));
    px = BinarySearchTable(processed, 0, UTFToAnsi(files[ix]), SORT_ALPHA);
    if (px < 0) {
      px = pmx;
      }
    if (px != pmx) {
      if (line == processed[px][1]) {
        continue;
        }
      }

We create a line of SQL and store it in the line variable. We then use the BinarySearchTable function to find if an entry for this file exists in our database cache file. If an entry isn’t found, we set the result to the end of the list. If the result was not the end of the list, we check to see if our would be SQL query matches what we did last time. If the resulting SQL query was the same as the last time the script ran, we don’t need to do anything for this entry so we continue. At this point px is either the index to an entry in our cache that needs to be updated or equal to pmx, indicating this is a new entry. So, let’s process each case.


    if (px == pmx) {
      processed[npmx][0] = UTFToAnsi(files[ix]);
      processed[npmx][1] = UTFToAnsi(line);
      AddMessage(line);
      res = ODBCQuery(db, line);
      if (IsError(res)) {
        if (GetLastErrorMessage() != "No Data") {
          AddMessage("Query Failed %s!", GetLastErrorMessage());
          break;
          }
        }
      res = ODBCQuery(db, "SELECT LAST_INSERT_ID() as `id`;");
      if (IsError(res)) {
        AddMessage("Get Last ID Failed %s!", GetLastErrorMessage());
        break;
        }
      parts = ODBCFetchArray(res);
      processed[npmx][2] = parts[0];
      CloseHandle(res);
      npmx++;
      }

If this is a new entry, we update processed with the name of the file and the line of SQL. We then use ODBCQuery to run the line of SQL. Remember, if our query does not have result, for example, an insert command, we need to not only check for an error condition but also the error message using GetLastErrorMessage. If the query worked, we then use ODBCQuery again to get the ID of the last insert. Then we use ODBCFetchArray to read the result and store that in our cache. If you have been paying attention you will have noticed that our cache (processed) is now three columns: a file name, a SQL line, and the ID of the row in the database. We close the SQL result handle and increment npmx since we added a new entry. For an update, the code is as follows:


    else {
      processed[px][1] = UTFToAnsi(line);
      AddMessage(line);
      line = UTFToAnsi(FormatString("UPDATE `images` SET `source` = \"%s\", `version` = \"%s\", `author` = \"%s\" `lastmod` = \"%s\" WHERE `image_id` = %s  LIMIT 1;", source, ver, author, lastmod, processed[px][2]));
      res = ODBCQuery(db, line);
      if (IsError(res)) {
        if (GetLastErrorMessage() != "No Data") {
          AddMessage("Query Failed %s!", GetLastErrorMessage());
          break;
          }
        }
      }

This code is simpler, we update processed with the new SQL line. We then create a line of SQL to update the database using the ID in the processed table. Lastly, we run the line using ODBCQuery and check for errors. We don’t increment npmx since we are only updating our cache, not adding an entry. Finally, we wrap up the loop by copying this image into the staging area.


    CreateFolder(AddPaths(out, fname));
    CopyFile(AddPaths(GetScriptFolder(), files[ix]), AddPaths(out, files[ix]));
    }

We use CreateFolder to make a folder in the staging area using the name of the file and then copy the image into the folder using the CopyFile function. At this point there is only one step left from our initial design. This is step seven: we need to save the cached information. Luckily for us, Legato makes this super easy with a single call to CSVWriteTable.


  CSVWriteTable(processed, AddPaths(GetScriptFolder(), "dbcache.csv"));
  }

With that line, our function is complete. Although this script does not have a lot of practical use outside of my own personal website (though it could be adapted if you, too, would like to create a database of image files with meta information), it does showcase the ease in which Legato allows you to combine many data formats with ease. Armed with these tools you can use Legato in any of your upcoming projects.

The complete script:

void main() {


  string files[];
  string table[][];
  string processed[][];
  string parts[];
  string fname, name, id, group, source, ver, author, activity, type, lastmod;
  string line;
  string fn, out;
  int ix, mx, tx, tmx, px, pmx, npmx;
  handle db, res;
  
  db = ODBCConnect("Driver=MySQL ODBC 8.0 Ansi Driver;Server=MyServer.com;Database=MyDB;Uid=MyUser;Pwd=******;");
  if (IsError(db)) {
    AddMessage("Couldn't connect to DB %s", GetLastErrorMessage());
    return;
    }

  processed = CSVReadTable(AddPaths(GetScriptFolder(), "dbcache.csv"));
  pmx = ArrayGetAxisDepth(processed);
  npmx = pmx;
  SortTable(processed, SORT_ALPHA, 0);

  table = CSVReadTable(AddPaths(GetScriptFolder(), "List.csv"));
  tmx = ArrayGetAxisDepth(table);
  if (tmx == 0) {
    AddMessage("Couldn't Load CSV");
    return;
    }
      
  files = EnumerateFiles(AddPaths(GetScriptFolder(), "*.png;*.gif;*.jpg", FOLDER_LOAD_RECURSE | FOLDER_UTF_NAMES);
  mx = ArrayGetAxisDepth(files);
  if (mx == 0) {
    AddMessage("No Files");
    return;
    }

  out = AddPaths(GetScriptFolder(), "_Staging\");
  RecycleFile(out);
  CreateFolder(out);

  AddMessage("# Changes");
  ProgressOpen("Checking Images...", 0);  
  for (ix = 0; ix < mx; ix++) {
    ProgressUpdate(ix, mx);
    ProgressSetStatus("Checking %s", files[ix]);

    parts = ExplodeString(files[ix], "\\");
    name = parts[0];
    fname = name;
    ver = "Final";
    name = ReplaceInString(name, " (PH)", "");
    if (IsNotError(GetLastError())) {
      ver = "Placeholder";
      }
    parts = ExplodeString(parts[1], " ", 1);
    id = ReplaceInStringRegex(parts[0], "^([0-9]+)[0-9]$", "$1"); 
    group = ReplaceInStringRegex(parts[0], "^[0-9]+([0-9])$", "$1"); 
    activity = ClipFileExtension(parts[1]);
    type = ReplaceInString(GetExtension(parts[1]), ".", "");
    
    author = "";
    source = "";
    for (tx = 0; tx < tmx; tx++) {
      if (id == table[tx][0]) {
        source = table[tx][2];
        author = table[tx][3];
        break;
        }
      }
    lastmod = GetFileModifiedTimeString(AddPaths(GetScriptFolder(), files[ix]));


    line = UTFToAnsi(FormatString("INSERT INTO `images` (`name`, `fname`, `id`, `group`, `source`, `author`, `version`, `activity`, `type`, `lastmod`) VALUES (\"%s\",\"%s\",%s,%s,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\");", name, fname, id, group, source, author, ver, activity, type, lastmod));
    px = BinarySearchTable(processed, 0, UTFToAnsi(files[ix]), SORT_ALPHA);
    if (px < 0) {
      px = pmx;
      }
    if (px != pmx) {
      if (line == processed[px][1]) {
        continue;
        }
      }
    
    if (px == pmx) {
      processed[npmx][0] = UTFToAnsi(files[ix]);
      processed[npmx][1] = UTFToAnsi(line);
      AddMessage(line);
      res = ODBCQuery(db, line);
      if (IsError(res)) {
        if (GetLastErrorMessage() != "No Data") {
          AddMessage("Query Failed %s!", GetLastErrorMessage());
          break;
          }
        }
      res = ODBCQuery(db, "SELECT LAST_INSERT_ID() as `id`;");
      if (IsError(res)) {
        AddMessage("Get Last ID Failed %s!", GetLastErrorMessage());
        break;
        }
      parts = ODBCFetchArray(res);
      processed[npmx][2] = parts[0];
      CloseHandle(res);
      npmx++;
      }
    else {
      AddMessage("OLD %s: ", processed[px][1]);
      AddMessage("NEW %s: ", UTFToAnsi(line));
      processed[px][1] = UTFToAnsi(line);
      line = UTFToAnsi(FormatString("UPDATE `images` SET `source` = \"%s\", `version` = \"%s\", `author` = \"%s\" `lastmod` = \"%s\" WHERE `image_id` = %s  LIMIT 1;", source, ver, author, lastmod, processed[px][2]));
      res = ODBCQuery(db, line);
      if (IsError(res)) {
        if (GetLastErrorMessage() != "No Data") {
          AddMessage("Query Failed %s!", GetLastErrorMessage());
          break;
          }
        }
      }
    CreateFolder(AddPaths(out, fname));
    CopyFile(AddPaths(GetScriptFolder(), files[ix]), AddPaths(out, files[ix]));
    }

  CSVWriteTable(processed, AddPaths(GetScriptFolder(), "dbcache.csv"));
  }

David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission.