Legato Developers Corner #12: Understanding Strings

Friday, December 09. 2016

Legato Developers Corner #12: Understanding Strings

One of the design goals of Legato was to make it easy for programmers to process and manage textual information. To that end, there are hundreds of string functions in Legato that simplify parsing and manipulating text.

The data type string defines an expandable array, or ‘string’, of characters. Within Legato, as with many languages, strings are conventionally terminated with a zero-byte or \0 byte. This indicates that character data continues until the last character which is followed by a zero terminator. When strings are added, copied, or just generally managed, Legato expects that zero-byte to end the string. This is important to note since strings are not very good at storing binary data. The string seems fine until any zero-byte appears. If your binary data has a zero-byte in it, the string will end before the data does. Dealing with binary information and strings will be covered in a later article.

Data can be loaded into strings by reading information from a file, using functions that return strings, or setting a string variable to a literal value:

string s;
s = "Hello World! Today is " + GetLocalTime(DS_MONTH_DAY_YEAR | DS_INITIAL);
MessageBox(s + ".");

If copied into the IDE and run, a message will appear something like:

A message box containing the string set in the first example script.

Note that we used a little math to combine strings, including adding a period at the end of the string within the message box call. Also notice the double quote is used to contain the ‘literal’ string data. If you want to place a quote into the string, it must be ‘escaped’ with a backslash, which itself can be escaped:

s = "This is a \"Special Message\" for you.";

s = "C:\\Program Files\\";

Here the double backslash is needed to escape the escape character. Certain letters have special meaning when combined with the backslash, such as \r, \n and \t for return (13), new line (10) and tab (9), respectively. These can also be written as \13, \10 and \9. To use these, just list numeric literals, and if the number is led by a 0, it is assumed to be octal. Hex codes can be added by inserting an ‘x’, \xD, \xA or \x9, respectively. These conventions are common amongst many languages.

Single quotes allow a character to be treated like a number or single character value. Escape characters also apply.

Returning to our string, if you insert that zero-byte:

string s;
s = "I added a zero byte \0 right here.";
MessageBox(s);

The message box will display:

A message box containing the string set in the second example script.

Notice that the segment “right here” is missing, That is because as the string was copied to the variable, the zero-byte stopped the copy process. Legato sees that as the end of the string. If a zero-byte is inserted in the variable, the MessageBox function would not receive any data passed it.

If we need to know the size of a string, the GetStringLength SDK function will return the size of string, less the terminating zero-byte.

Strings as Pseudo Arrays

As mentioned before, strings are actually expandable character arrays. As such, you can reference individual characters in the array if you know the character’s position. For example:

string s;
int x;
s = "My Data";
while (s[x] != 0) {
  AddMessage("Pos %d value is %3d and character '%c'", x, s[x], s[x]);
  x++;
  }

will display in the log:

Pos 0 value is  77 and character 'M'
Pos 1 value is 121 and character 'y'
Pos 2 value is  32 and character ' '
Pos 3 value is  68 and character 'D'
Pos 4 value is  97 and character 'a'
Pos 5 value is 116 and character 't'
Pos 6 value is  97 and character 'a'

Note that the first character is at index 0. The program runs until s[x] == 0, the end of the string. So this:

char s[];
int x;
s = "My Data";
while (s[x] != 0) {
  AddMessage("Pos %d value is %3d and character '%c'", x, s[x], s[x]);
  x++;
  }

produces essentially the same result. Declaring ‘char s[]’ without a size means that the array ‘c’ is auto allocable and will expand as required. However, the script engine will never give back space if the variable is made smaller. This approach also requires considerably more internal overhead. The array could also be a fixed size. For routines expecting string data, a character array works the same way. So why use a character array? For certain low level actions, such as reading a block of data from a file, it can be much more effective and it can also be used to store binary data.

If we add the line:

AddMessage("Allocated %d, used %d", ArrayGetAxisSize(s), ArrayGetAxisDepth(s));

to the program and run it again, the additional line:

Allocated 200, used 8

will appear in the log. Note that 200 bytes were preallocated and 8 bytes are being used. (Remember the \0? It added a byte.). The ArrayGetAxisSize and ArrayGetAxisDepth SDK functions will not operate on a simple declared string because it is not actually a data array.

Finally, strings can also be declared as multiple dimensional arrays of strings, so:

string s[2];
int i, x;
s[0] = "Entry A";
s[1] = "and B";
while (i < 2) {
  AddMessage("String at %d - '%s':", i, s[i]);
  x = 0;
  while (s[i][x] != 0) {
    AddMessage("  %d - %3d '%c'", x, s[i][x], s[i][x]);
    x++;
    }
  i++;
  }

will display the following in the log:

String at 0 - 'Entry A':
  0 -  69 'E'
  1 - 110 'n'
  2 - 116 't'
  3 - 114 'r'
  4 - 121 'y'
  5 -  32 ' '
  6 -  65 'A'
String at 1 - 'and B':
  0 -  97 'a'
  1 - 110 'n'
  2 - 100 'd'
  3 -  32 ' '
  4 -  66 'B'

Data can be exploded in to a string array using the ExplodeString SDK function and arrays are frequently used by functions to return complex data.

String Math

Can you add strings together? As we have already seen, yes. However, you cannot logically divide or subtract strings. The ‘+’ operator can be used inline or cumulatively, for example (as above, ‘s’ is defined as a string type):

s = "Today " + "is " + "Monday";

s  = "Today ";
s += "is ";
s += "Monday";

as we add the string contents to create “Today is Monday”. The ‘.=’ operator is also allowed for compatibility with conventions of other languages.

Strings can also be compared via the ‘<’, ‘>’, ‘==’, ‘<=’ and ‘>=’ operators. A comparison is performed on a character by character basis, strictly on the binary code for each character, until the match is complete or a zero-byte is reached. That means “and” and “AND” are different. But what if you want to treat them the same? You must either normalize the case of the strings with the MakeUpperCase or MakeLowerCase SDK functions or use the CompareStringsNoCase function. The CompareStringsNoCase function returns -1, 0, or 1, depending on the result of the comparison as less than, equal to, or greater than. There is a version of the function which takes case into account called the CompareStrings function.

More Comparisons

Legato provides the conventional functions IsInString and InString. The former simply returns true or false depending on whether or not the target string contains the matching character or string. The latter returns the zero-based position of the first match or -1 if a match could not be made.

To expand on the ‘in string’ concept, the FindInString SDK function adds a level of sophistication by allowing the specification of a starting search position and optional case sensitivity. The ReplaceInString function allows the matching segment to be replaced with a new segment. Further, the IsRegEx function performs a conventional and powerful regular expression match.

When working with lists, the ScanString function allows for the linear searching of a string as if it was a delimited list of items. The delimiter can be spaces, commas or line endings.

Finally, if you have a single dimension list of strings or a two dimension table, the FindInList and FindInTable functions allow lists and tables to be linearly searched for matching items.

I Just Want Parts

We saw earlier that we can examine or even change individual characters when we use a string like a character array. What about multi-character segments? The GetStringSegment and DeleteStringSegment functions allow a program to easily extract and delete parts of a string. For example:

string s;
int x;

s = "SUSPENDED FORM TYPE 8-K (0000991680-16-001234)";

x = InString(s, '(');

AddMessage("CIK       : %s", GetStringSegment(s, x+1, 10));
AddMessage("Year      : %s", GetStringSegment(s, x+12, 2));
AddMessage("Serial    : %s", GetStringSegment(s, x+15, 6));

s = DeleteStringSegment(s, 0, x + 1);
s = GetStringSegment(s, 0, 20);

AddMessage("Accession : %s", s);

will display in the log:

CIK       : 0000991680
Year      : 16
Serial    : 001234
Accession : 0000991680-16-001234

The InString function looks for an opening ‘(’ character and stores the position in x. (Note that there is no error checking in this example. If the ‘(’ is not in the string, the program results will not be as desired.) x is then used as a basis for the GetStringSegment function to extract parts of the string. Finally, the DeleteStringSegment function is used to get just the accession number.

Conclusion

This introduction just touches on the many string functions available to programmers in Legato. In later articles, we will explore saving and loading strings to and from files, exploding strings, parameters and command lines, character scanning, conversion, and many other functions.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.