Sometimes HTML code can be hard to read. While this isn’t an issue for editors and viewers that work with the rendered results, it can be a problem when advanced editing needs to happen. For example, a client sent over an HTML file that looks nice in the browser but it contains coding errors. When you go into the code to fix the errors, it’s hard to make the necessary changes because of problems like poor spacing.
Friday, November 24. 2017
LDC #60: Messy Code? No problem!
When using a product like GoFiler, you can always import the HTML file through the HTML to HTML filter to fix coding errors, but this will re-code the whole document. If you just want to correct the issues instead of changing the whole document, you need another solution. This is where Legato comes in. With its advanced SGML parsing capabilities, we can make a script to clean up HTML code and make it easier to read.
This week’s script “neatifies” HTML code. We can accomplish this using the SGML Object to parse the HTML and a String Pool to hold the edited HTML. Then we can write the entire String Pool over the file in a single edit action to allow the user to undo the function if he doesn’t like the results. Here is the complete script:
// // // GoFiler Legato Script - Neatify HTML // ------------------------------------ // // Rev 10/13/2017 // // // (c) 2017 Novaworks, LLC -- All rights reserved. // // Takes HTML Code and "neatifies" it to indent properly // /********************************************************/ /* Global Items */ /* ------------ */ /********************************************************/ int run (int f_id, string mode, handle hEditWindow); int setup (); /****************************************/ int setup() { /* Called from Application Startup */ /****************************************/ string fnScript; /* Us */ string item[10]; /* Menu Item */ int rc; /* Return Code */ /* */ /* ** Add Menu Item */ /* * Define Function */ item["Code"] = "EXTENSION_NEATIFY_CODE"; /* Function Code */ item["MenuText"] = "&Neatify Code"; /* Menu Text */ item["Description"] = "<B>Neatify Code</B> \r\rFormats HTML code."; /* Description (long) */ /* * Check for Existing */ rc = MenuFindFunctionID(item["Code"]); /* Look for existing */ if (IsNotError(rc)) { /* Was already be added */ return ERROR_NONE; /* Exit */ } /* end error */ /* * Registration */ rc = MenuAddFunction(item); /* Add the item */ if (IsError(rc)) { /* Was already be added */ return ERROR_NONE; /* Exit */ } /* end error */ fnScript = GetScriptFilename(); /* Get the script filename */ MenuSetHook(item["Code"], fnScript, "run"); /* Set the Test Hook */ return ERROR_NONE; /* Return value (does not matter) */ } /* end setup */ /****************************************/ int main() { /* Initialize from Hook Processor */ /****************************************/ handle window; /* Window Handle */ string windows[][]; /* List of All Windows */ int size; /* Size of Edit window List */ int ix; /* Counter */ /* */ /* ** Initialize Hook Processor */ /* * Check State */ if (GetScriptParent() != "LegatoIDE") { /* Not running in IDE? */ return ERROR_NONE; /* Done */ } /* end not in IDE */ /* * Set up */ setup(); /* Add to the menu */ /* * Run Us (Debug Helper) */ windows = EnumerateEditWindows(); /* Get all edit windows */ size = ArrayGetAxisDepth(windows); /* Get size of array */ for (ix = 0; ix < size; ix++) { /* For each edit window */ if (GetExtension(windows[ix]["Filename"]) == ".htm") { /* Window is HTML file? */ MessageBox("Running on window: %s", windows[ix]["Filename"]); /* Display Message */ window = MakeHandle(windows[ix]["ClientHandle"]); /* Get Handle to Window */ run(0,"preprocess",window); /* Run Our Function */ return ERROR_NONE; /* Only run on one window */ } /* end window is HTML */ } /* end loop through windows */ return ERROR_NONE; /* return */ } /* end function */ /****************************************/ int run(int f_id, string mode, handle hEditWindow) { /* Call from Hook Processor */ /****************************************/ handle hEditObject; /* Edit Object Handle */ handle hSGML; /* SGML Object */ handle hPool; /* String Pool */ dword type; /* Type of Window */ string s1, s2; /* General Strings */ boolean add_returns; /* Want Returns */ boolean was_close; /* Last Tag was Close */ boolean pre_text; /* Preformatted */ boolean first_item; /* Is the first item */ boolean added_space; /* Last thing was space */ int ex, ey; /* End X and Y */ int token; /* Token */ int indent; /* Indent Level */ int rc; /* Return Code */ /* */ /* ** Run Neatify */ /* * Safety */ if (mode != "preprocess") { /* Not preprocess? */ return ERROR_NONE; /* Return w/ no error */ } /* end not preprocess */ if (hEditWindow == NULL_HANDLE){ /* No Handle? */ hEditWindow = GetActiveEditWindow(); /* Get handle to edit window */ } /* end no handle */ /* * Check Window */ /* o Handle */ if (IsError(hEditWindow)) { /* Window handle is no good? */ MessageBox('x', "Cannot get edit window."); /* Display error */ return ERROR_EXIT; /* Exit w/ Error */ } /* end bad handle */ /* o Type */ type = GetEditWindowType(hEditWindow) & EDX_TYPE_ID_MASK; /* Get the type of the window */ if ((type != EDX_TYPE_PSG_PAGE_VIEW) && /* Type is Not Page View */ (type != EDX_TYPE_PSG_TEXT_VIEW)){ /* and not Code View? */ MessageBox('x', "This is not an HTML edit window."); /* Display error */ return ERROR_EXIT; /* Exit w/ Error */ } /* end bad type */ /* * Setup */ hEditObject = GetEditObject(hEditWindow); /* Create Edit Object from Window */ hSGML = SGMLCreate(hEditObject); /* Create SGML Object */ hPool = PoolCreate(); /* Create Pool Object */ indent = -1; /* No Indent */ was_close = FALSE; /* No Close */ pre_text = FALSE; /* Not Preformatted */ first_item = TRUE; /* Next thing is first item */ added_space = FALSE; /* Wasn't a space */ /* * Parse Loop */ s1 = SGMLNextItem(hSGML); /* Get First Item */ while (s1 != "") { /* While in SGML Code */ switch (SGMLGetItemType(hSGML)) { /* Based on Type */ case SPI_TYPE_SPACE: /* o Space */ if (pre_text == FALSE) { /* Not preformatted? */ if (was_close == FALSE) { /* Last thing was not a close tag */ PoolAppend(hPool, " "); /* Add a Space */ added_space = TRUE; /* We added a space */ } /* end last thing was not a close tag */ } /* end not preformatted */ else { /* Preformatted? */ PoolAppend(hPool, s1); /* Add as-is */ added_space = FALSE; /* Not space */ } /* end preformatted */ break; /* end case */ case SPI_TYPE_TEXT: /* o Text */ case SPI_TYPE_CHAR: /* o Character Entity */ PoolAppend(hPool, s1); /* Pass as-is */ added_space = FALSE; /* Not space */ break; /* end case */ case SPI_TYPE_TAG: /* o Tag */ add_returns = TRUE; /* Reset Flag */ token = SGMLGetElementToken(hSGML); /* Get Token */ /* . Special Tag Processing */ if (token == HT_PRE) { /* Pre Tag? */ pre_text = TRUE; /* Set flag */ } /* end pre tag */ if (token == HT__PRE) { /* Pre Tag? */ pre_text = FALSE; /* Set flag */ } /* end pre tag */ /* . Check for Tag Type */ rc = HTMLIsBlockElement(hSGML); /* Get If Block */ if (rc == HTML_IS_FALSE) { /* Not Block? */ rc = HTMLIsTableFrameElement(hSGML); /* Check if Table */ } /* end not block */ if (rc == HTML_IS_FALSE) { /* Not Table or Block? */ rc = HTMLIsListContainer(hSGML); /* Check if List */ } /* end not table or block */ if (rc == HTML_IS_FALSE) { /* Not Table or Block or List? */ rc = HTMLIsHeadElement(hSGML); /* Check if Head */ } /* end not table or block or list */ /* . Checks */ if (rc == HTML_IS_DOC_CLOSE) { /* Doc close? */ rc = HTML_IS_CLOSE; /* Lie */ } /* end Doc close */ /* . Adjust State by Type */ if (rc != HTML_IS_FALSE) { /* A Block Tag? */ if (rc == HTML_IS_OPEN) { /* Open Tag? */ indent += 1; /* Increae Indent */ was_close = FALSE; /* Last is not close */ } /* end open tag */ if (rc == HTML_IS_CLOSE) { /* Close Tag? */ indent -= 1; /* Decease Indent */ if (was_close == FALSE) { /* Last was not a close? */ add_returns = FALSE; /* No returns */ } /* end last was not a close */ was_close = TRUE; /* Last is now a close */ } /* end close tag */ if (rc == HTML_IS_SELF) { /* Self Close Tag? */ add_returns = FALSE; /* No returns */ was_close = TRUE; /* Last is now a close */ } /* end close tag */ } /* end block tag */ else { /* Inline Tag */ was_close = FALSE; /* Not a close tag */ add_returns = FALSE; /* No returns on inline */ } /* end inline tag */ /* . Special Formatting */ if (FindInString(s1, "<!--") == 0) { /* Comment? */ add_returns = TRUE; /* Put on separate line */ } /* end comment */ if (token == HT_TITLE) { /* Special case */ add_returns = TRUE; /* Put on separate line */ } /* end special case */ if (token == HT__HEAD) { /* Special case */ add_returns = TRUE; /* Put on separate line */ } /* end special case */ /* . Add To Pool */ if ((add_returns == TRUE) && (first_item == FALSE)) { /* Want Returns? */ if (added_space == TRUE) { /* Last thing was a space? */ PoolSetPosition(hPool, PoolGetPosition(hPool) - 1); /* Go back over it */ } /* end last was a space */ s2 = ""; /* Clear String */ if (indent > 0) { /* Have Indent? */ s2 = PadString("", indent); /* Make Padding String */ } /* end have indent */ if (was_close == TRUE) { /* Last Tag was a close? */ s2 += " "; /* Add a space */ } /* end last tag was a close */ PoolAppend(hPool, "\r" + s2); /* Add Indent and Return */ } /* end want returns */ PoolAppend(hPool, s1); /* Add the tag */ added_space = FALSE; /* Not space */ break; /* end case */ } /* end switch on type */ /* o Next item */ first_item = FALSE; /* Added something */ s1 = SGMLNextItem(hSGML); /* Get Next Element */ } /* end parse loop */ /* * Replace File */ s1 = PoolGetString(hPool); /* Get File */ ey = GetLineCount(hEditObject) - 1; /* Get Last Line */ ex = GetLineSize(hEditObject, ey); /* Get Last X */ WriteSegment(hEditObject, s1, 0, 0, ex, ey); /* Replace File */ CloseHandle(hEditObject); /* Close Edit Object */ return ERROR_NONE; /* Exit Done */ } /* end function */
The setup function that adds the script function to the ribbon has been covered many times in the past so we will skip over it. The main function checks to see if the script is being run from the IDE and if so it adds the hook. It also looks at the open windows for an HTML file that it can be run on. This is useful for debugging the script as debugging hooks is much harder.
Now on to the core of the script, the run function. First, it checks the run mode and the passed window handle. If it is running from the menu hook, there will be no handle. If not, it uses the active window. It then checks the window’s type to make sure the function can be run.
/* * Safety */ if (mode != "preprocess") { /* Not preprocess? */ return ERROR_NONE; /* Return w/ no error */ } /* end not preprocess */ if (hEditWindow == NULL_HANDLE){ /* No Handle? */ hEditWindow = GetActiveEditWindow(); /* Get handle to edit window */ } /* end no handle */ /* * Check Window */ /* o Handle */ if (IsError(hEditWindow)) { /* Window handle is no good? */ MessageBox('x', "Cannot get edit window."); /* Display error */ return ERROR_EXIT; /* Exit w/ Error */ } /* end bad handle */ /* o Type */ type = GetEditWindowType(hEditWindow) & EDX_TYPE_ID_MASK; /* Get the type of the window */ if ((type != EDX_TYPE_PSG_PAGE_VIEW) && /* Type is Not Page View */ (type != EDX_TYPE_PSG_TEXT_VIEW)){ /* and not Code View? */ MessageBox('x', "This is not an HTML edit window."); /* Display error */ return ERROR_EXIT; /* Exit w/ Error */ } /* end bad type */
After all the safety checks we can prepare to parse the document by getting the Edit Object and creating the SGML Object using the Edit Object. We also set up the output by using PoolCreate to create an empty string pool. As a reminder a string pool is an object we can use to efficiently append many smaller strings together. Since the parser breaks the file into small pieces this is a good way to assemble our edited file as we go. We also clear some variables used by our parsing loop. The indent variable is the current indent level we are using and will be used to add spaces. The next four variables, was_close, pre_text, first_item, and added_space are flags to indicate the following: the last item was a close tag, we are inside preformatted text, this is the first tag we’ve seen, and whether the last thing we added to the output was a space.
/* * Setup */ hEditObject = GetEditObject(hEditWindow); /* Create Edit Object from Window */ hSGML = SGMLCreate(hEditObject); /* Create SGML Object */ hPool = PoolCreate(); /* Create Pool Object */ indent = -1; /* No Indent */ was_close = FALSE; /* No Close */ pre_text = FALSE; /* Not Preformatted */ first_item = TRUE; /* Next thing is first item */ added_space = FALSE; /* Wasn't a space */
Now that all the preparation is done, we can begin to parse the file. We are using SGMLNextItem since we want not just the tags but also all the spaces, text, and character entities as well. So we will loop until SGMLNextItem returns an empty string. Our processing inside the loop is based on the type of item we encounter while parsing. If we encounter a tag there is different processing from text or spaces. In order to accomplish this we use the SGMLGetItemType function to get the type of the current parsed item. We then have a switch statement based on that type. To finish the loop we mark that we are no longer the first item in the file by setting first_item to false and then call SGMLNextItem to get the next thing.
/* * Parse Loop */ s1 = SGMLNextItem(hSGML); /* Get First Item */ while (s1 != "") { /* While in SGML Code */ switch (SGMLGetItemType(hSGML)) { /* Based on Type */ ... ... ... } /* end switch on type */ /* o Next item */ first_item = FALSE; /* Added something */ s1 = SGMLNextItem(hSGML); /* Get Next Element */ } /* end parse loop */
Let’s start with space processing. In HTML, multiple whitespace characters are treated as a single character (unless they are inside a PRE tag). So with that in mind if we are not in a PRE tag (pre_text flag) and the last item was not a close tag (was_close flag) we will add a single space regardless of what kind of space was in the file. We will then set that we added a space by setting the added_space flag. If we are in a PRE tag we will take the space as-is from the source and then clear the added_space flag since we don’t want to edit any spacing inside the PRE tag. It is important to note that properly encoded non-breaking spaces ( or  ) will not be processed here since they are character entities and not spaces.
case SPI_TYPE_SPACE: /* o Space */ if (pre_text == FALSE) { /* Not preformatted? */ if (was_close == FALSE) { /* Last thing was not a close tag */ PoolAppend(hPool, " "); /* Add a Space */ added_space = TRUE; /* We added a space */ } /* end last thing was not a close tag */ } /* end not preformatted */ else { /* Preformatted? */ PoolAppend(hPool, s1); /* Add as-is */ added_space = FALSE; /* Not space */ } /* end preformatted */ break; /* end case */
Next is text and character entity processing. There isn’t much to discuss here since we want the text and entities to stay as-is. If we wanted to do any special processing on the text (such as replacing words or something like that) it could be done here but for this script we will copy the contents, set that the last thing wasn’t a space (clear added_space) and move on.
case SPI_TYPE_TEXT: /* o Text */ case SPI_TYPE_CHAR: /* o Character Entity */ PoolAppend(hPool, s1); /* Pass as-is */ added_space = FALSE; /* Not space */ break; /* end case */
Now we can dig into tag processing. Our script’s purpose is to neaten HTML code so in order to do that we need to add indenting and also cleanup excessive whitespace. As stated above, any whitespace in HTML is treated as a single space; this includes line returns. Therefore, we cannot simply put every tag on its own line. For inline tags doing so could insert unwanted white space. This is especially an issue when dealing with specific revisions for EDGAR. Consider what would happen if an inline tag was in the middle of a word. A space would then also be in the middle of the word. That would be no good.
So, the first thing we need to do is determine what kind of tag we are dealing with and if it needs any special processing. In order to do this we will use a few SDK functions. The first is SGMLGetElementToken, which returns the a tokenized value for the tag. What this means is instead of checking if the current item begins with “<a ” or “<table ” we can check if the returned token is HT_A or HT_TABLE respectively. This is a faster operation than string comparisons and is also less prone problems resulting from coding inconsistencies in the HTML. After getting the token we check to see if it is a PRE tag since those tags require special processing. We could also add script tags here if we want the javascript code to be left alone but for EDGAR HTML it is not needed. Next we will use some new functions from the current version the HTMLIs....Element functions. HTMLIsBlockElement return whether the current tag is a block tag and if so whether it is an open tag or close tag. Likewise, the HTMLIsTableFrameElement does the same thing for table tags (like TD and TR). These functions combined with the tokens mean we don’t need to have big lists of elements that need processing. For some past scripts like Steven’s Editing CSS Properties series, we could easily add processing for all blocks using this function. In order to adjust indenting we need to add more indenting when a block is opened and remove indenting when a block is closed. So we check if the current element is a block, table, list or head element.
case SPI_TYPE_TAG: /* o Tag */ add_returns = TRUE; /* Reset Flag */ token = SGMLGetElementToken(hSGML); /* Get Token */ /* . Special Tag Processing */ if (token == HT_PRE) { /* Pre Tag? */ pre_text = TRUE; /* Set flag */ } /* end pre tag */ if (token == HT__PRE) { /* Pre Tag? */ pre_text = FALSE; /* Set flag */ } /* end pre tag */ /* . Check for Tag Type */ rc = HTMLIsBlockElement(hSGML); /* Get If Block */ if (rc == HTML_IS_FALSE) { /* Not Block? */ rc = HTMLIsTableFrameElement(hSGML); /* Check if Table */ } /* end not block */ if (rc == HTML_IS_FALSE) { /* Not Table or Block? */ rc = HTMLIsListContainer(hSGML); /* Check if List */ } /* end not table or block */ if (rc == HTML_IS_FALSE) { /* Not Table or Block or List? */ rc = HTMLIsHeadElement(hSGML); /* Check if Head */ } /* end not table or block or list */
After all the checks rc can have the following values
HTML_IS_FALSE for any inline or unknown tag.
HTML_IS_OPEN for an open block, table, list, or HTML structure tag.
HTML_IS_CLOSE for a close block, table, list, or HTML structure tag.
HTML_IS_DOC_CLOSE for a close HTML/BODY tag.
HTML_IS_SELF for any self closed HTML tag (e.g. <img src=“cat.jpg” />.)
Now that we have that information we can deal with it accordingly. The first thing we do is if rc is HTML_IS_DOC_CLOSE we will make it the same as HTML_IS_CLOSE since we don’t need to differentiate between close HTML/BODY tags and any other close block. Next we adjust our state flags based on this information. If it is anything other than HTML_IS_FALSE we need to do some special processing. If it is HTML_IS_OPEN we need to increase our indent and set was_close to false. If it is HTML_IS_CLOSE we need to decrease our indent and if the last item was not also a close tag we shouldn’t add returns (so we clear add_returns.) Additionally we set was_close to true since we are a close tag. Finally, if it is HTML_IS_SELF we clear add_returns since we don’t want them and also set was_close since we are a close tag. If the value was HTML_IS_FALSE we set was_close to false and add_returns to false since we don’t want any special processing.
/* . Checks */ if (rc == HTML_IS_DOC_CLOSE) { /* Doc close? */ rc = HTML_IS_CLOSE; /* Lie */ } /* end Doc close */ /* . Adjust State by Type */ if (rc != HTML_IS_FALSE) { /* A Block Tag? */ if (rc == HTML_IS_OPEN) { /* Open Tag? */ indent += 1; /* Increae Indent */ was_close = FALSE; /* Last is not close */ } /* end open tag */ if (rc == HTML_IS_CLOSE) { /* Close Tag? */ indent -= 1; /* Decease Indent */ if (was_close == FALSE) { /* Last was not a close? */ add_returns = FALSE; /* No returns */ } /* end last was not a close */ was_close = TRUE; /* Last is now a close */ } /* end close tag */ if (rc == HTML_IS_SELF) { /* Self Close Tag? */ add_returns = FALSE; /* No returns */ was_close = TRUE; /* Last is now a close */ } /* end close tag */ } /* end block tag */ else { /* Inline Tag */ was_close = FALSE; /* Not a close tag */ add_returns = FALSE; /* No returns on inline */ } /* end inline tag */
Now that we have updated our state we are ready to add the information into the output pool. But before doing so we can do some processing for special cases. For us, this means comments and HTML heading tags. We want comments to be treated as blocks but also not change the indenting (please note that an inline comment may introduce spacing but this is a rare case as most HTML comments are used to mark segments of code). Also we want the TITLE to be treated as an inline element that is also indented on its own line.
/* . Special Formatting */ if (FindInString(s1, "<!--") == 0) { /* Comment? */ add_returns = TRUE; /* Put on separate line */ } /* end comment */ if (token == HT_TITLE) { /* Special case */ add_returns = TRUE; /* Put on separate line */ } /* end special case */ if (token == HT__HEAD) { /* Special case */ add_returns = TRUE; /* Put on separate line */ } /* end special case */
Now that specialty processing is done, we can finally add the information to the output. First thing we want to do is check if we want to add returns and an indent. This is easily done by checking the add_returns flag and the indent value. We don’t want to add returns before the open HTML tag so we also check the first_item flag. If we want the return we need to check if the last thing we wrote out was a space. If so, we need to back over it in the output otherwise the previous line will have a trailing space. This isn’t a big deal but the added_space flag and the PoolSetPosition function make it easy to fix. The PoolGetPosition and PoolSetPosition functions allows us to change the output position in the pool to the previous character. After dealing with the space we can create our indenting. For this we use the PadString function that inserts spaces in front of a string value. We can feed it an empty string to make a string of just spaces. Then if the last tag was a close tag we will add a single space to account for the fact the close tag decreased the indent. We then use PoolAppend to add a return plus the indent string to the output pool. Then finally we add the tag to the pool and set added_space to false since the last thing in the pool was not a space.
/* . Add To Pool */ if ((add_returns == TRUE) && (first_item == FALSE)) { /* Want Returns? */ if (added_space == TRUE) { /* Last thing was a space? */ PoolSetPosition(hPool, PoolGetPosition(hPool) - 1); /* Go back over it */ } /* end last was a space */ s2 = ""; /* Clear String */ if (indent > 0) { /* Have Indent? */ s2 = PadString("", indent); /* Make Padding String */ } /* end have indent */ if (was_close == TRUE) { /* Last Tag was a close? */ s2 += " "; /* Add a space */ } /* end last tag was a close */ PoolAppend(hPool, "\r" + s2); /* Add Indent and Return */ } /* end want returns */ PoolAppend(hPool, s1); /* Add the tag */ added_space = FALSE; /* Not space */ break; /* end case */ } /* end switch on type */
Now that the main parsing loop is complete we can write the output back to the source file. We use the PoolGetString function to get the entire pool as a string (this does mean the maximum file size that this function can be run on is limited to the maximum legato string). We then use GetLineCount and GetLineSize to get the last position in the file and finally the WriteSegment to replace the whole file with our string. Because we used the WriteSegment function, undo information is automatically added to the window. This means the user can use undo within GoFiler to restore the original code.
This blog shows how Legato’s SGML parser can be used to do processing on HTML files even when only the spacing of the tags is important. This file can also be used as a template for any script that will process an entire HTML file.
David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission. |
Additional Resources