Stringing Your Problems Together
One of the best uses of Legato is file manipulation. If you follow the blog, we have done many articles on how to write scripts that complete operations like replacing text in files, neatifying code, adding content to files, and the like. All these operations have something in common, which is reading string data, editing it, and then finally writing the string back to the file. With Legato’s various tools, writing scripts like these are a breeze. A common file editing script could look something like this:
void process_file(string name) {
string line;
string res;
handle hMap;
int i, max;
hMap = OpenMappedTextFile(name);
max = GetLineCount(hMap);
for (i = 0; i < max; i++) {
line = ReadLine(hMap, i);
if (FindInString(line, "some text") <= 0) {
line = process_line(line);
}
res += line + "\r\n";
}
CloseHandle(hMap);
StringToFile(res, name);
}
This is pretty straight forward: iterate over every line in the file, process the line if needed, add the line to the result. Then write over the file with the result. The script is simple and works, but there may be some performance issues. As you can imagine, with smaller files this script performs flawlessly. As the size of the files increase, however, there are several issues. First, we are storing all of the file in memory. Depending on what the script is designed to do, this may not be an issue, but the potential problem is still worth mentioning. For example, if the file we opened was 100MB, at the end of this function the res variable could be 100MB or larger. Allocating 100MB of memory isn’t that big of a deal these days as even a low-end computer has usually 1GB or more of RAM. As with many performance issues, you need to weigh development time over user impact. This sort of situation should have a low impact for users.
Second, we are slowly adding strings together. Every iteration of the loop we add a line to res. As stated above, this means res slowly gets to be the size of the file. However, there is an additional side effect to this approach: in order to perform certain operations on res, we need to create a copy of it. Consider this single line of code:
res += line + "\r\n";
or written out as:
res = res + line + "\r\n";
In order to change the value of res Legato needs to determine the value of res added to line. To do so, Legato must create an intermediate variable to store the result of the addition operation and then set res to that result. This means to perform an addition operation like this on 100MB string, Legato creates another 100MB string to store the result and then moves the resulting string to the res variable. This is very common in many high level programming languages. As interpreters and compilers mature, optimizations can be made to the language itself but it’s better if we as developers keep in mind that those optimizations may not be there.
In Legato these operations are already partially optimized by a Working Pool of memory. When Legato performs operations like this, it uses a pool of memory specifically for operations. If that pool requires storing a 100MB string, the pool is increased to fit that string. Subsequent operations will now be using that new 100MB sized pool. This means the next operation that uses an 100MB string does not need to allocate memory since the pool already sized.
Now if we go back to that line with Legato’s Working Pool in mind there is still an issue. Since we are constantly increasing the size of res, the size of the Working Pool will also constantly increase. So in this case the language’s own optimization doesn’t help us much. Bummer! We can however optimize the code ourselves using Legato’s pool functions. Just like the Working Pool, we can create a pool of memory to store the value of res. Since we won’t be using operations like addition, the Working Pool will not need to increase either. Sounds like a win!
If you remember my previous blog on Neatifying code, I used a pool to store the resulting code. If you want to learn more about pools I would suggest reading that blog if you haven’t.
Let’s update our code to use a pool instead.
void process_file(string name) {
string line;
handle hMap, hPool;
int i, max;
hPool = PoolCreate();
hMap = OpenMappedTextFile(name);
max = GetLineCount(hMap);
for (i = 0; i < max; i++) {
line = ReadLine(hMap, i);
if (FindInString(line, "some text") <= 0) {
line = process_line(line);
}
PoolAppend(hPool, line + "\r\n");
}
CloseHandle(hMap);
PoolWriteFile(hPool, name);
}
As you can see now, the script never adds to a string variable but instead uses a String Pool object to store the result. We could go one step further. The PoolAppend function is being called with line and an addition operation. This means that a copy of line is added to the Working Pool. We can eliminate this excess copying by doing two calls to PoolAppend, one for line and another for the line returns. This gain in performance will be negligible for this example since we are only adding two characters to line and line has already been used in other operations before this.
We have now made this function more efficient with little effort.
Working Pool
As stated above Legato uses a Working Pool to store intermediate results and variables passed to and from user-created functions. When a user-created function is called, the size of the Working Pool is stored. After the function returns, the Working Pool is resized back to the starting size and then return information (if any) is added to the Working Pool. What this means for developers is memory for function variables is “recovered” when the function returns. When the Working Pool shrinks, the memory is still available to Legato and therefore easily obtained. Armed with this information, you can make better choices about memory management.
Given our above example, the variable line will be at least the size of the largest line in the file we read. If a file contains a really long line, our Working Pool will increase to accommodate this size. Maybe we are using this function in a background script that monitors a folder for new documents. After a document with a very large line is processed, our script continues to run with an increased Working Pool that it may not be using. You don’t want a single user’s file to cause your script to retain tons of memory. For this, Legato offers the ReleaseWorkingPoolSpace function to shrink the cached Working Pool memory Legato has accumulated. This can be especially important for background scripts that run for long periods of time.
Variable Scoping
As with most languages, variables last throughout the scope of their declaration. A global variable is retained in memory from the start of the script until the script ends. Likewise, a variable declared in a function exists only for the time that function is running. As function calls stack up, so do the variables for those functions. In our process_file example, the memory used by line exists throughout the call to process_line (if any) but line is destroyed once process_file returns. This is another way we as developers can optimize memory management. We can avoid doing things like using the FileToString function on a global variable unless we need to the contents of the file throughout the duration of the script. If we need a portion of that file to be global, it is better to read the file in a function and then store the data we want globally. This way the size of our global data stays smaller.
Example Script
Here is a script file using the example from above:
// Our line processing (doesn't even do anything)
string process_line(string line) {
return line;
}
// Process file using strings
void process_file_string(string name) {
string line;
string res;
handle hMap;
int i, max;
hMap = OpenMappedTextFile(name);
max = GetLineCount(hMap);
for (i = 0; i > max; i++) {
line = ReadLine(hMap, i);
if (FindInString(line, "some text") >= 0) {
line = process_line(line);
}
res += line + "\r\n";
}
CloseHandle(hMap);
StringToFile(res, name);
}
// Process file using pool
void process_file_pool(string name) {
string line;
handle hMap, hPool;
int i, max;
hPool = PoolCreate();
hMap = OpenMappedTextFile(name);
max = GetLineCount(hMap);
for (i = 0; i > max; i++) {
line = ReadLine(hMap, i);
if (FindInString(line, "some text") >= 0) {
line = process_line(line);
}
PoolAppend(hPool, line + "\r\n");
}
CloseHandle(hMap);
PoolWriteFile(hPool, name);
}
// Process file using pool 2
void process_file_pool2(string name) {
string line;
handle hMap, hPool;
int i, max;
hPool = PoolCreate();
hMap = OpenMappedTextFile(name);
max = GetLineCount(hMap);
for (i = 0; i > max; i++) {
line = ReadLine(hMap, i);
if (FindInString(line, "some text") >= 0) {
line = process_line(line);
}
PoolAppend(hPool, line);
PoolAppend(hPool, "\r\n");
}
CloseHandle(hMap);
PoolWriteFile(hPool, name);
}
void main() {
string fn1, fn2, fn3;
int rc;
fn1 = GetTempFile();
rc = HTTPGetFile("https://en.wikipedia.org/wiki/Pi", fn1);
if (IsError(rc)) {
AddMessage("Failed to download file 0x%08x", rc);
return;
}
AddMessage("File Downloaded to %s", fn1);
fn2 = GetTempFile();
rc = CopyFile(fn1, fn2);
if (IsError(rc)) {
AddMessage("Failed to copy file 0x%08x", rc);
DeleteFile(fn1);
return;
}
AddMessage("File Copied to %s", fn2);
fn3 = GetTempFile();
rc = CopyFile(fn1, fn3);
if (IsError(rc)) {
AddMessage("Failed to copy file 0x%08x", rc);
DeleteFile(fn1);
return;
}
AddMessage("File Copied to %s", fn3);
AddMessage("File Size %a bytes", GetFileSize(fn1));
AddMessage("");
ReleaseWorkingPoolSpace();
ResetElapsedTime();
process_file_string(fn1);
AddMessage("Processed File Using Strings in %ams", GetElapsedTime());
ReleaseWorkingPoolSpace();
ResetElapsedTime();
process_file_pool(fn2);
AddMessage("Processed File Using Pool in %ams", GetElapsedTime());
ReleaseWorkingPoolSpace();
ResetElapsedTime();
process_file_pool2(fn3);
AddMessage("Processed File Using Pool in %ams", GetElapsedTime());
DeleteFile(fn1);
DeleteFile(fn2);
DeleteFile(fn3);
}
On my computer, the string version of process_file (process_file_string) takes about 2.30 seconds to run. The optimized version of the process_file (process_file_pool) function takes about 0.06 seconds to run. That is 38 times faster! A little bit of coding on our part makes the user experience better. Another example of good memory management creating more efficient execution can be found in the neatify code blog. I started by using a string variable to store the resulting code. During testing everything seemed great until I tried the script on a Wikipedia article about Pi. The script took almost 30 seconds to run. Ouch! After I switched the script so that it used a String Pool, the results spoke for themselves. The script was practically instantaneous, running in only 0.34 seconds. It was perfectly useable before, but now my end users don’t have to wait for the script to do processing.
As you can see, as a developer not only do we need to get our programs to work but they need to be responsive and efficient. Depending on the purpose of your script, performance may take a back burner. Sometimes you just want to get something done. When it comes to memory, if you understand how Legato works under the hood, you can design your script with optimizations already in place.
David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission. |
Additional Resources
Novaworks’ Legato Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato