This week’s blog post is going to focus on Legato’s ability to sort and search tables and lists. Many times when developing, you need to collect up a list or table of data and perform operations or analysis on that data. When analysis is involved it usually means finding specific entries in the list or table. As the data set grows this search can start to get very expensive. I’ve created a script to show how searching lists and tables can impact your script performance.
Friday, September 13. 2019
LDC #152: Sorting and Searching Tables
For an example I decided to write a script that creates a histogram of words used in a text document. This is a simplistic example as all it does with the histogram is track the most common word in the text (spoiler alert: it’s “the”). In reality analysis usually involves something a bit more complex but since this is about searching the analysis part is not important. Let’s dive right into the script:
string text; #define WCOL 0 #define CCOL 1 void run_linear (); void run_binary (); void run_mix (); void main() { text = HTTPGetString("http://www.textfiles.com/etext/FICTION/2city10.txt"); if (text == "") { AddMessage("Failed to load text 0x%08x", GetLastError()); return; } AddMessage("Running with Linear Search..."); run_linear(); AddMessage("Running with Binary Search..."); run_binary(); AddMessage("Running with Binary with partial linear..."); run_mix(); }
This portion of the script is some set up. We have a global text variable that is the text we are going to parse. I made this global so we only download the text file once. Since we are working with a table I made defines for the column indices we will use in the script. Next are the function definitions. Since this blog is about sorting we have three functions that do exactly the same thing. The run_linear function will build a histogram by linearly searching the table for the current word. The run_binary function will binary sort the table and then search it for the current word. Lastly, the run_mix function will perform mix of both operations. The reason for this will become clear after running the script.
The main function loads our text source. I picked a text file version of “A Tale of Two Cities”, by Charles Dickens, since it will have plenty of words. The main function then runs each of our processing functions to highlight differences. Let’s take a look at the run_linear function:
void run_linear() { string words[][]; string s1; handle wp; int wx, last, mc, cc, mix; qword t_tot, t_search, t_sort, t_start1, t_start2; wp = WordParseCreate(WP_GENERAL, text); if (IsError(wp)) { AddMessage("Failed to load parser 0x%08x", GetLastError()); return; } t_start1 = GetTickCount(); last = 0; t_search = 0; t_sort = 0; mc = 0; s1 = WordParseGetWord(wp); while (s1 != "") { // Remove Punctuation s1 = ConvertNoPunctuation(s1); // Linearly Find Word t_start2 = GetTickCount(); for (wx = 0; wx < last; wx++) { if (words[wx][WCOL] == s1) { break; } } t_search += GetTickCount() - t_start2; if (wx == last) { words[wx][WCOL] = s1; words[wx][CCOL] = "1"; last++; } else { cc = DecimalToInteger(words[wx][CCOL]) + 1; if (cc > mc) { mc = cc; mix = wx; } words[wx][CCOL] = FormatString("%d", cc); } // Next Word s1 = WordParseGetWord(wp); } t_tot = GetTickCount() - t_start1; AddMessage(" Unique Words: %d", last); AddMessage(" Most Common Word: %s (%d times)", words[mix][WCOL], words[mix][CCOL]); AddMessage(" Sort Time: %dms", t_sort); AddMessage(" Search Time: %dms", t_search); AddMessage(" Total Time: %dms", t_tot); }
For these functions the code in black is pretty much the same between the functions. There are small changes to deal with the tracking the most used word but the changes are negligible. The blue code contains the only changes we care about. We start off by loading the string into the word parser. Since we want words we will leave it in generic mode. We will then set up a few variables and fall into the parse loop. We are using GetTickCount to keep track of how long operations are taking. Since this function returns the number of milliseconds since the computer was started we can use the difference between two different calls to time things. For this example we will time the overall histogram construction and then also the the time it took to search for words while building the histogram.
Now, we take the word and remove punctuation using ConvertNoPunctuation since commas and periods will mess up the analysis. Then we find the word in our words table. For this function, we use a simple for loop to iterate over each word in the table to find a match. If it’s found, we increment the count for that word; if not, we add a new entry. We also check if this word’s count exceeds the current maximum count we’ve found. Finally, we get the next word from the text and the loop repeats.
Now that the table is built we can print the results which in our case is how long it took to build the table. Before moving on to the next functions, I’m going to show the results of this function for comparison purposes. Your results for the times may differ since computer speeds vary but it should look something like this (assuming you comment out the calls to the other functions):
Running with Linear Search... Unique Words: 11649 Most Common Word: the (7383 times) Sort Time: 0ms Search Time: 105687ms Total Time: 106297ms
At a quick glance you can see that almost all of the script run time is spend searching through the list of words. This isn’t too surprising since it is pretty much the only thing the script does. Also, as you can see, on my computer this took about 1 minute and 45 seconds to run. From an optimization standpoint this performance isn’t very bad considering how many words are in “A Tale of Two Cities.” But let’s see if we can do better with Legato’s BinarySearchTable and SortTable functions:
void run_binary() { string words[][]; string s1, mcw; handle wp; int wx, last, mc, cc; qword t_tot, t_search, t_sort, t_start1, t_start2; wp = WordParseCreate(WP_GENERAL, text); if (IsError(wp)) { AddMessage("Failed to load parser 0x%08x", GetLastError()); return; } t_start1 = GetTickCount(); last = 0; t_search = 0; t_sort = 0; mc = 0; s1 = WordParseGetWord(wp); while (s1 != "") { // Remove Punctuation s1 = ConvertNoPunctuation(s1); // Binary Search t_start2 = GetTickCount(); wx = BinarySearchTable(words, WCOL, s1, SORT_ALPHA); t_search += GetTickCount() - t_start2; if (wx >= 0) { cc = DecimalToInteger(words[wx][CCOL]) + 1; if (cc > mc) { mc = cc; mcw = s1; } words[wx][CCOL] = FormatString("%d", cc); } else { words[last][WCOL] = s1; words[last][CCOL] = "1"; t_start2 = GetTickCount(); SortTable(words, SORT_ALPHA, WCOL); t_sort += GetTickCount() - t_start2; last++; } // Next Word s1 = WordParseGetWord(wp); } t_tot = GetTickCount() - t_start1; wx = BinarySearchTable(words, WCOL, mcw, SORT_ALPHA); AddMessage(" Unique Words: %d", last); AddMessage(" Most Common Word: %s (%d times)", words[wx][WCOL], words[wx][CCOL]); AddMessage(" Sort Time: %dms", t_sort); AddMessage(" Search Time: %dms", t_search); AddMessage(" Total Time: %dms", t_tot); }
This time we use the BinarySearchTable function to search the search the list. This function requires that the table is sorted by the column we are searching. Since the first time running our table is empty we don’t need to worry about this. However, as we add entries to the list we will need to sort the list using the SortTable function. For this code we now also keep track of how long the sort function runs. Since the list being sorted is required for the binary search to function, we need to include it in our time spent analysis. Running the script now yields a result similar to this:
Running with Linear Search... Unique Words: 11649 Most Common Word: the (7383 times) Sort Time: 0ms Search Time: 105687ms Total Time: 106297ms Running with Binary Search... Unique Words: 11649 Most Common Word: the (7383 times) Sort Time: 39681ms Search Time: 125ms Total Time: 40578ms
As you can see using the BinarySearchTable and SortTable functions shaves about a minute off the run time. This is pretty good but if you look at the results you can see that while the search was extremely fast our time is now mostly spent sorting the list. This is because every time we add an entry to the list we need to re-sort the list in order for the binary search to work. Ideally, we would insert the new word into the correct spot but since we don’t know where that spot is we need to use the sort function. However we can instead use a mix of both approaches. Let’s take a look at the mixed version:
void run_mix() { string words[][]; string unwords[][]; string s1, mcw; handle wp; int wx, last, mc, cc, slast, uslast; qword t_tot, t_search, t_sort, t_start1, t_start2; wp = WordParseCreate(WP_GENERAL, text); if (IsError(wp)) { AddMessage("Failed to load parser 0x%08x", GetLastError()); return; } t_start1 = GetTickCount(); last = 0; slast = 0; uslast = 0; t_search = 0; t_sort = 0; mc = 0; s1 = WordParseGetWord(wp); while (s1 != "") { // Remove Punctuation s1 = ConvertNoPunctuation(s1); // Binary Search and Linear Search t_start2 = GetTickCount(); wx = BinarySearchTable(words, WCOL, s1, SORT_ALPHA); if (wx < 0) { for (wx = 0; wx < uslast; wx++) { if (unwords[wx][WCOL] == s1) { break; } } t_search += GetTickCount() - t_start2; if (wx == uslast) { wx = -1; } else { cc = DecimalToInteger(unwords[wx][CCOL]) + 1; if (cc > mc) { mc = cc; mcw = s1; } unwords[wx][CCOL] = FormatString("%d", cc); } } else { t_search += GetTickCount() - t_start2; if (wx >= 0) { cc = DecimalToInteger(words[wx][CCOL]) + 1; if (cc > mc) { mc = cc; mcw = s1; } words[wx][CCOL] = FormatString("%d", cc); } } if (wx < 0) { unwords[uslast][WCOL] = s1; unwords[uslast][CCOL] = "1"; uslast++; // Add unsorted to Sorted if (uslast >= 100) { for (wx = 0; wx < uslast; wx++) { words[slast][WCOL] = unwords[wx][WCOL]; words[slast][CCOL] = unwords[wx][CCOL]; slast++; } uslast = 0; t_start2 = GetTickCount(); SortTable(words, SORT_ALPHA, WCOL); t_sort += GetTickCount() - t_start2; } last++; } // Next Word s1 = WordParseGetWord(wp); } t_tot = GetTickCount() - t_start1; wx = BinarySearchTable(words, WCOL, mcw, SORT_ALPHA); AddMessage(" Unique Words: %d", last); AddMessage(" Most Common Word: %s (%d times)", words[wx][WCOL], words[wx][CCOL]); AddMessage(" Sort Time: %dms", t_sort); AddMessage(" Search Time: %dms", t_search); AddMessage(" Total Time: %dms", t_tot); }
In this function we use the BinarySearchTable function but when an item is not found we search a smaller table of unsorted words. If the word is found in either table we updated its count. If it was not found in either we add it to the unsorted table. If the the unsorted table is too large we copy the entries to the sorted list and then run the SortTable function. I chose to limit the unsorted list to only 100 words but you can play with this amount in your own scripts to see what works best. Running the script now should yield something similar to this:
Running with Linear Search... Unique Words: 11649 Most Common Word: the (7383 times) Sort Time: 0ms Search Time: 105687ms Total Time: 106297ms Running with Binary Search... Unique Words: 11649 Most Common Word: the (7383 times) Sort Time: 39681ms Search Time: 125ms Total Time: 40578ms Running with Binary with partial linear... Unique Words: 11649 Most Common Word: the (7383 times) Sort Time: 639ms Search Time: 393ms Total Time: 1563ms
Now the histogram builds in 1.5 seconds! That is 68 times faster than the linear search and 25 times faster than the binary search alone. The only real downside to the last approach is the code is much more complex. However, even if you only use the search and sort routines on the entire list the performance gains can be a lifesaver. Keep in mind that Legato also has BinarySearchList and SortList which work the same way but with arrays. These functions offer a great deal of power and flexibility both in the best of times and the worst of times.
David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato