Legato Developers Corner #6: Basic SGML Parsing

Skip to blog entries
Skip to archive page
Skip to right sidebar

Friday, October 21. 2016

Legato Developers Corner #6: Basic SGML Parsing

This week we will examine SGML parsing in Legato with two simple scripts. Many times you may want to simply scan or parse HTML or XML in order to locate particular data or log information. At that point, the data can be modified or other operations can be performed upon it. Legato contains word parsing and a full SGML Parse object. Full SGML parsing provides for DTD (Document Type Definition or Schema) checking, error checking and recovery, multiple line tag support, writing tools, as well as many other things. It’s powerful, but it also has a steeper learning curve. The SGML Parse object will be covered in a later series of articles. For now, let’s focus on the SGML mini-parse functions.

The following concepts will be explored in this installment:

SGML, HTML, and XML
SGML Parsing
Using a WordParse Object

Our Sample Script #1

	handle		hParse;
	string		s1, s2;
	int		item;

	hParse = WordParseCreate(WP_SGML_TAG);
	WordParseSetData(hParse, "<p align='center'>This is a <b>little</b> &amp; bold paragraph.</p>");

	s1 = WordParseGetWord(hParse);
	while (s1 != "") {
	  s2 = "";
	  if (WordParseHasSpace(hParse)) { s2 = "space"; }
	  AddMessage("%2d %-8s : %s", ++item, s2, s1);
	  s1 = WordParseGetWord(hParse);
	  }

Script Walkthrough

First, lets talk a little about SGML, HTML, and XML. There are plenty of websites and books that can dig deep into tagging concepts, but basic idea is this: SGML is a collection of tags contained in open and closed chevrons (‘<’ and ‘>’) and data which is considered parsed character data, or PCDATA. Words and characters can just be placed into the stream of text, but since the chevron characters are used for tags, they must be expressed as character entities: < and >. That then leads to the ampersand being protected, so ‘&’ must be represented as as &. If you are parsing XML, these are the only named character entities.

HTML, on the other hand, has many named character entities, like " or &eruo;. Character entities can also employ values, such as † which is the Unicode character ‘†’. Character entity values can be used in both HTML and XML, so long as the browser understands the character code for the specified font.

You will notice a common thread with character entities. They start with ampersand, have a name or #value, and end with semicolon.

An example of XML:

<name>Curtis & Lewis</name>

HTML:

<p align="center">Chapter Six – Parsing <XML> Tags</p>

Note that the tags employ a forward slash ‘/’ to indicate a closing element. In the HTML example, there is an attribute with the value ‘align’. An important difference in the mini-parse functions versus the SGML Parse object is that the mini-parse functions do not care about the elements, attributes, or PCDATA. They just parse tags, character entities, and spaces.

The SGML mini-parse concept employs two major components: the Word Parse Object and a series of SGML tag functions. The word parser nibbles at the source data, pulling discrete chunks from it that are tags, character entities, and text. It also keeps track of leading white space, which can be important depending on what you are try to accomplish.

It is easy to set up the Word Parse Object, but it is important to tell it that you want to work with SGML. There are three modes that can be used when the object is created: General (WP_GENERAL), SGML (WP_SGML_TAG) and Program (WP_PROGRAM and WP_PROGRAM_GROUP). Hand the mode to the SDK function WordParseCreate as a parameter:

hParse = WordParseCreate(WP_SGML_TAG);

The hParse variable is declared as a handle data type. Optionally, we can add a second parameter, which would be source string data, or we can add the data to the object using the WordParseSetData SDK function:

WordParseSetData(hParse, data);

We then use the WordParseGetWord SDK function to successively retrieve each item. Let’s parse through a little data and dump it to the default IDE log:

WordParseSetData(hParse, "<p align='center'>This is a <b>little</b> &amp; bold paragraph.</p>");

s1 = WordParseGetWord(hParse);
while (s1 != "") {
  s2 = "";
  if (WordParseHasSpace(hParse)) { s2 = "space"; }
  AddMessage("%2d %-8s : %s", ++item, s2, s1);
  s1 = WordParseGetWord(hParse);
  }

Our script sets up the parser, gets the first item, and continues to loop until there are no more items. The output would appear as follows:

1 : <p align='center'> 
2 : This 
3 space : is 
4 space : a 
5 space : <b> 
6 : little 
7 : </b> 
8 space : &amp;
9 space : bold 
10 space : paragraph. 
11 : </p>

Note that once a word has been parsed, all the characteristic of the item are available. In the above example, we are using the WordParseHasSpace SDK function to determine if there is a leading space. The WordParseGetResult SDK function returns information regarding why the parser stopped as it was getting the next item. You can also get the position of the data in the source string, so making replacements in the string is easy. Note since the word parser keeps its own copy of the data, once a string has been added to the parser it can be discarded or changed by the script.

Our Sample Script #2

	handle		hParse;
	string		s1, s2;
	string		aa[];
	int		ix, size;
	int		rc, item;

	s1 = FileToString("https://en.wikipedia.org/wiki/XML");
	if (s1 == "") {
	  rc = GetLastError();
	  AddMessage("Failed to load page (%08X)", rc);
	  exit;
	  }

	hParse = WordParseCreate(WP_SGML_TAG, s1);

	s1 = WordParseGetWord(hParse);
	while (s1 != "") {
	  s2 = GetTagElement(s1);
	  if (s2 != "") {
	    aa = GetTagAttributes(s1);
	    AddMessage("%4d %-16s  -: %s", ++item, s2, s1);
	    size = ArrayGetAxisDepth(aa);
	    for (ix = 0; ix < size; ix++) {
	      AddMessage("       %-16s : %s", ArrayGetKeyName(aa, ix), aa[ix]);
	      }
	    }
	  s1 = WordParseGetWord(hParse);
	  }

Script Walkthrough

So now that we can get discrete bits of HTML or XML, what can we do with them? That depends on our goals. Let’s look at tags. First, it would be nice to know if a parsed item is an SGML tag. The IsSGMLTag SDK function helps with that. It simply looks at the provided string and returns TRUE if the string matches an SGML tag or FALSE if it doesn’t.

Knowing if it’s a tag, we can get the element name using the GetTagElement SDK function. This function also has an option to include the namespace information with the element. The GetTagAttributes function will disassemble attributes into a named array, making it easy to check them.

Let’s play and take apart HTML tags in a webpage:

s1 = FileToString("https://en.wikipedia.org/wiki/XML");
if (s1 == "") {
  rc = GetLastError();
  AddMessage("Failed to load page (%08X)", rc);
  exit;
  }

hParse = WordParseCreate(WP_SGML_TAG, s1);

This time we create our SGML Parse object using a URL. We create a string of the website using the SDK function FileToString. We can then pass that string to the WordParseCreate function.

With this, we can begin parsing:

s1 = WordParseGetWord(hParse);
while (s1 != "") {
  s2 = GetTagElement(s1);
  if (s2 != "") {
    aa = GetTagAttributes(s1);
    AddMessage("%4d %-16s  -: %s", ++item, s2, s1);
    size = ArrayGetAxisDepth(aa);
    for (ix = 0; ix < size; ix++) {
      AddMessage("       %-16s : %s", ArrayGetKeyName(aa, ix), aa[ix]);
      }
    }
  s1 = WordParseGetWord(hParse);
  }

Note a couple of new functions, in particular the SDK function GetTagAttributes. You can check if a string contains an SGML element by using the IsSGMLTag SDK function, but here we are simply checking the result of the GetTagElement function against a NULL result in order to proceed. The GetTagAttributes function returns a string attribute array, aa, which is loaded with the attribute values. The attribute names are the key names of the array, which makes it easy to iterate through the array as we do here and write the attribute information.

Here are the first few lines of output followed by the part of the body:

   1 !DOCTYPE           : <!DOCTYPE html>
       html               
   2 html               : <html class="client-nojs" lang="en" dir="ltr">
       class              client-nojs
       lang               en
       dir                ltr
   3 head               : <head>
   4 meta               : <meta charset="UTF-8"/>
       charset            UTF-8

                 ...

  50 div                : <div id="jump-to-nav" class="mw-jump">
       id                 jump-to-nav
       class              mw-jump
  51 a                  : <a href="#mw-head">
       href               #mw-head
  52 /a                 : </a>
  53 a                  : <a href="#p-search">
       href               #p-search
  54 /a                 : </a>
  55 /div               : </div>
  56 div                : <div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr">
       id                 mw-content-text
       lang               en
       dir                ltr
       class              mw-content-ltr
  57 div                : <div class="mw-stack" style="box-sizing: border-box; float:right;">
       class              mw-stack
       style              box-sizing: border-box; float:right;

With these functions, you can see how HTML or XML can be easily scanned with Legato. This can be used, for example, to search an RSS feed looking for a filing or perhaps to find a result in a HTTP query to a website followed by a ‘web scrape’. There are a number of additional related SDK functions such as the GetTagTextContent and IsSGMLEmptyElement functions that can further enhance SGML parsing.

One final word. If you are moving through data line-by-line, for example using the ReadLine SDK function and handing the strings to the the parser one-by-one, tags that are broken by lines breaks will appear to be incomplete to the word parser. In that case, the parser will fail on split tags and not return complete item. Hence, as in our above example, an entire page can be loaded and parsed as a single string. This way, returns do not break the flow of the string. The more sophisticated SGML Parse object has the capability to deal effectively with such conditions and also has error recovery when the code is not well formed.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.