EDGAR 18.3 was released on October 1st, 2018, and while most of its features are clearly documented, one change that was made seems to have flown under the radar of most people. Starting with this version, all facts created with elements of data type StringItemType are no longer rendered as HTML. Instead they are rendered as plain old strings. While this makes sense (and does close some potential security risks on the SEC’s side), it does effect how people are doing XBRL. For example, most people doing Risk Return taxonomy filings use <B> tags to make their heading items bold. This is no longer possible, because it would come out rendered like:
Friday, November 30. 2018
LDC #112: XBRL Rendering of Strings, Untangled!
<B>Investment Objective</B>
Instead of the proper:
Investment Objective
I’ve asked Structured Data about the issue and they have confirmed that it is indeed an intentional change, and it’s not going away. They also let me know that it doesn’t effect compliance, but many filers may want to change how they’re doing things anyway to make their documents look nicer.
GoFiler 4.24b, which was released on November 19th, does help by converting from HTML to Unicode strings, but what happens if you used an earlier version of GoFiler and simply upgrade to the latest version? All of your strings are going to be full of HTML code! This is a problem I’ve encountered a few times, and like any good programmer, I would rather spend a few hours writing a bit of code to automate a task than repeatedly spend thirty minutes fixing an issue over and over again.
The script being covered this week is the first time I’ve actively tried to change XBRL data with Legato and as such was a fun learning experience. This script works by running the validate function to ensure we have a validation log before parsing that validation log to look for error messages. Once we have an error message that indicates we have HTML in a string, we know where the problem is, so we can copy the text from that fact, modify it, and paste it back in. Easy! Let’s take a look at how it works.
We only have a single defined value in this script, MSG_MASK. It’s simply a regular expression against which we can check our error messages. If the error message matches this, then it means the error message is talking about a string containing HTML, so we know it’s a good target for our script to deal with.
#define MSG_MASK "^\\s*Error: Cell [A-Z|0-9]* contains an invalid fact. String data should not contain HTML tagging. HTML does not render properly in all viewers.$"
Structurally, our script this week is super simple. It’s just a main function, a run function, and a setup function. The main and setup functions are identical to basically every main and setup function in every other script, so there’s no need to discuss them. The run function is where the real work happens, so we’ll spend our time going over how it operates.
/****************************************/ int run(int f_id, string mode){ /* run save function */ /****************************************/ ... omitted variable declarations ... wstring woutput; /* output string */ wstring translated; /* translated character */
Normally I just skip over the variable declarations when explaining how these scripts work, but I think woutput and translated need a bit more time dedicated to them to explain what’s going on. These are both wstring variables instead of just normal string variables. That means they are 16 bit (wide) strings instead of 8 bit strings. We need to do this because in order to properly convert HTML to a string value, we’ll need to use a 16 bit wstring just in case the HTML we’re translating has a character entity in it that is beyond the 8 bit range. I used the “w” prefix in woutput to differentiate it from the 8-bit output both for consistency and because “wow-tput” is fun to say.
/* */ if (mode!="preprocess"){ /* if not preprocess */ return ERROR_NONE; /* exit */ } /* */ hWindow = GetActiveEditWindow(); /* get the active edit window */ hData = DataViewGetObject(hWindow); /* get the data view object */ msg_props = DataViewGetTab(hData); /* get tab data */ if (msg_props["Name"]!="XBRL Fields"){ /* if current field is not xbrl fields */ MessageBox('x',"Function must be run on XBRL Fields table"); /* display error */ return ERROR_EXIT; /* return */ } /* */ hSgml = SGMLCreate(); /* create SGML */ SGMLSetHTMLDTD(hSgml,DTD_DEFAULT_HTML_50); /* set the DTD */ RunMenuFunction("XBRL_VALIDATE"); /* run the XBRL Validation function */
We can start our function as we normally do: by checking to make sure we’re running in preprocess mode. Once we’re sure of that, we can get the active edit window and the handle to the Data Object inside it, hData. We also need to get the active tab and check that the name of the active tab is “XBRL Fields” because this function is designed to work only when the user has the XBRL Fields tab open and active. If that tab is not opened and active, the function tells the user and then returns an error. If we are on the XBRL Fields table, we can create an SGML Parser and give it a DTD (we’ll need this later for character translations). After that, we use the RunMenFunction function to validate our XBRL file. We need to force a validation action so that we can read the log from it.
hInfo = GetInfoViewWindow(); /* get the information view */ hLog = InfoViewGetLogObject(hInfo); /* get the log in information view */ num_log = LogGetMessageCount(hLog,LOG_NONE); /* get items in log */ log = LogCreate("XBRL Remove Strings"); /* create log handle */ AddMessage(log,"Removing HTML from StringItemType elements"); /* add log message */ for(ix=0;ix<num_log;ix++){ /* for each log entry */ if(LogGetMessageType(hLog,ix)==LOG_ERROR){ /* if it's an error message */ msg = LogGetMessage(hLog,ix); /* get the message */ msg_props = LogGetMessageProperties(hLog,ix); /* get message properties */ if(IsRegexMatch(msg,MSG_MASK)){ /* is it the error we're looking for */
Once validation is complete, we can get the handle to the Information View and then grab the handle to the Log Object inside of it. We can get the number of messages in the log, start our own output log, and add a message to the output log to indicate we’re starting our function’s check. Then we can loop through each log entry so we can determine the type of the entry. If it’s of type LOG_ERROR, we know that the log item we’re examining is an error message, so we can get the text of the message, get the properties of the message, and test to see if it matches our defined error type, MSG_MASK. For more information on logs and their properties, check out Josh’s blog here. If the message matches, then we know this is an error message for an XBRL field that needs processing.
cpos = FindInString(msg,"Cell "); /* get the position of the word cell */ spos = FindInString(msg," ",cpos+5); /* get the position of the next space */ coords = GetStringSegment(msg,cpos+5,spos-(cpos+5)); /* get the coordinate string */ coordflags = TextToInteger(msg_props["EX"]); /* get the flags */ xycoords[1] = (0xFFFF & coordflags); /* get the row */ xycoords[0] = (0xFFFF & (coordflags >> 16)); /* get the column */ text = DataViewCellGetText(hData,xycoords[0],xycoords[1]); /* get the text of the cell */ AddMessage(log,"Found HTML at %s (Row %d,Col %d)", /* add log message */ coords,xycoords[0]+1,xycoords[1]+1); /* add log message */ AddMessage(log," Old: %s",GetStringSegment(text,0,200)); /* add log message */
Now that we have our error message, we can use the FindInString and GetStringSegment functions to obtain the coordinates of the XBRL data field as a string representation (like “AB123”) to display to our user. This is not always 100% reliable, even though that the file has all columns between AAA and ZZZ, so we can’t just convert a string coordinate into a Row/Column coordinate directly. We need to get the coordinate flags out of the error message itself. Logs contain position information, but because this is an XBRL view, both of the coordinates are stored in the “EX” property. We can use some bitwise operations to break this value into row and column values, which we’re storing in the xycoords array. Once we have our coordinates, we can use the DataViewCellGetText function to grab the contents of the cell, log that we found a cell to work on, and what its current contents are.
while(nextword!=""){ /* while we have a next word */ translated=""; /* reset translated to nothing */ if(IsSGMLTag(nextword) == false){ /* if the next word is not an SGML tag */ if(IsSGMLCharacterEntity(nextword) == true){ /* if it's a character entity */ translated[0] = SGMLCharEntityToValue(hSgml, /* translate character entity */ nextword); /* translate character entity */ rc = GetLastError(); /* get the last error */ if(IsError(rc)){ /* if last code was error */ MessageBox('x', /* display error */ "Cannot translate character, error %0x",rc); /* */ return ERROR_EXIT; /* return error */ } /* */ } /* */ else{ /* if not character */ translated = AnsiToUnicode(nextword); /* add next word to output */ } /* */ if(WordParseHasSpace(wParse)){ /* if there was space before nextword */ woutput+= AnsiToUnicode(" "); /* add a space to the output */ } /* */ woutput+= translated; /* add translated to output */ } /* */ nextword = WordParseGetWord(wParse); /* get the next word */ } /* */
Just getting the text isn’t enough. Now we need to translate it from HTML to a wstring so it will display correctly in the web browser. We’re going to create a Word Parse Object to do this using the WordParseCreate function. We’ll create our object in the WP_SGML_TAG mode so it will stop on any SGML tags or character entities. We can then get the next word. Using a while loop, we’ll iterate over each word in the text we need to translate. For each word, we’ll first reset translated back to blank in case it has something in it already from the previous iteration of the loop and then check if the word at which we’re looking is an SGML tag or not. If it’s an SGML tag, we can just ignore it and use the WordParseGetWord function to get the next word in the text. If it is an SGML tag, we need to check if it’s a character entity or not. If it’s a character entity, we need to use the SGMLCharEntityToValue function to convert our entity to a numeric value and set the first character of our translated string equal to that value. If we can’t translate it, the function displays an error and returns. If it’s just a regular word and not an entity, we instead use the AnsiToUnicode function to convert it from a string to a wstring so we can add it to our output. If we had some trailing spaces before the word we just parsed (the WordParseHasSpace function tests for this), we can just add a blank space to our output. Finally, we attach our translated string to our wide string woutput, and go back for the next word to parse.
output = UnicodeToAnsi(woutput); /* get ANSI output for logging */ AddMessage(log," New: %s",GetStringSegment(output,0,200)); /* add log message */ AddMessage(log,""); /* add spacer to log */ hClipboard = ClipboardCreate(); /* get clipboard */ rc = ClipboardSetUnicode(hClipboard,woutput); /* set clipboard text */ if (IsError(rc)){ /* if it's an error */ MessageBox('x',"Cannot Set Clipboard text, error %0x",rc); /* display error */ ProgressClose(); /* close progress bar */ return ERROR_EXIT; /* return error */ } /* */ CloseHandle(hClipboard); /* close the clipboard handle */
Next we can convert down our woutput to output (16 bit back to 8 bit) so we can log the string into our output. We use the ClipboardCreate function to gain access to the Windows Clipboard. We can set the content of the clipboard with the ClipboardSetUnicode function, and immediately test to see if this operation was successfully. If it was successful, we must IMMEDIATELY use the CloseHandle function on the clipboard handle. While Legato has a handle open to the clipboard, nothing else can access it, including other execution threads in GoFiler, so if it’s not released before we do our paste operation below it will result in errors. It is a good practice to release the clipboard as soon as possible.
rc = DataViewSetCaret(hData,xycoords[0],xycoords[1]); /* set active cell */ if (IsError(rc)){ /* if it's an error */ MessageBox('x',"Cannot Set Caret Position, error %0x",rc); /* display error */ ProgressClose(); /* close progress bar */ return ERROR_EXIT; /* return error */ } /* */ RunMenuFunction("EDIT_PASTE"); /* paste content to active cell */ fixed++; /* add to number of fixed facts */ } /* */ } /* */ } /* */ AddMessage(log,"Fixed %d facts.",fixed); /* add message */ LogDisplay(log); /* display the log */ return ERROR_EXIT; /* return no error */ } /* */
Finally, we’re in the home stretch. We can use the DataViewSetCaret function to set our cursor position on the cell we want to edit, test to make sure we’re actually set to the correct position, and then use the RunMenuFunction function to execute a Paste operation which will overwrite the contents of the selected cell with our newly converted string. Lastly, we can add a message to the log about the number of fixed facts, display the log, and exit our script.
Here’s the complete script without commentary:
// // GoFiler - Remove HTML From Strings // ------------------------------------ // // Opens HTML, and for each string element it removes HTML from the entry // // Revised 11-26-2018 SCH created // // (c) 2018 Novaworks, LLC. All rights reserved. // #define MSG_MASK "^\\s*Error: Cell [A-Z|0-9]* contains an invalid fact. String data should not contain HTML tagging. HTML does not render properly in all viewers.$" int setup (); int run (int f_id, string mode); /****************************************/ int setup(){ /* primary setup function */ /****************************************/ string item[]; /* params of menu item */ string script; /* script running */ int rc; /* return code */ /* */ item["Code"]="XBRL_REMOVE_HTML"; /* function name */ item["MenuText"]="Remove HTML From XBRL"; /* menu item */ item["Description"]="Remove HTML from XBRL StringItemType elements";/* discription */ item["Class"] = "XBRLExtension"; /* Add to XBRL toolbar */ MenuAddFunction(item); /* add item to the menu */ script = GetScriptFilename(); /* get name of this script */ MenuSetHook(item["Code"],script,"run"); /* set hook to run function */ return ERROR_NONE; /* return */ } /* */ /****************************************/ int run(int f_id, string mode){ /* run save function */ /****************************************/ string msg; /* message from log file */ string msg_props[]; /* properties of the log message */ string column; /* column we're looking for */ string text; /* text of a string */ wstring woutput; /* output string */ string output; /* string output */ string coords; /* unparsed coordinate substring */ string nextword; /* the next word */ wstring translated; /* translated character */ int trans; /* translated character value */ int fixed; /* number of fixed facts */ int xycoords[2]; /* x and y coordinates */ int cpos; /* position of word cell in string */ int spos; /* position of a space in a string */ int rc; /* return code */ int fields_ix; /* index of XBRL fields table */ int num_sheets; /* number of sheets in the data view */ int num_log; /* number of errors in log */ int ix; /* index counter */ dword coordflags; /* get coordinate flags */ handle hSgml; /* sgml handle */ handle log; /* log handle */ handle hData; /* handle to data view object */ handle wParse; /* word parser handle */ handle hInfo; /* information view handle */ handle hLog; /* get the information log */ handle hClipboard; /* get the clipboard */ handle hWindow; /* edit window handle */ /* */ if (mode!="preprocess"){ /* if not preprocess */ return ERROR_NONE; /* exit */ } /* */ hWindow = GetActiveEditWindow(); /* get the active edit window */ hData = DataViewGetObject(hWindow); /* get the data view object */ msg_props = DataViewGetTab(hData); /* get tab data */ if (msg_props["Name"]!="XBRL Fields"){ /* if current field is not xbrl fields */ MessageBox('x',"Function must be run on XBRL Fields table"); /* display error */ return ERROR_EXIT; /* return */ } /* */ hSgml = SGMLCreate(); /* create SGML */ SGMLSetHTMLDTD(hSgml,DTD_DEFAULT_HTML_50); /* set the DTD */ RunMenuFunction("XBRL_VALIDATE"); /* run the XBRL Validation function */ hInfo = GetInfoViewWindow(); /* get the information view */ hLog = InfoViewGetLogObject(hInfo); /* get the log in information view */ num_log = LogGetMessageCount(hLog,LOG_NONE); /* get items in log */ log = LogCreate("XBRL Remove Strings"); /* create log handle */ AddMessage(log,"Removing HTML from StringItemType elements"); /* add log message */ for(ix=0;ix<num_log;ix++){ /* for each log entry */ if(LogGetMessageType(hLog,ix)==LOG_ERROR){ /* if it's an error message */ msg = LogGetMessage(hLog,ix); /* get the message */ msg_props = LogGetMessageProperties(hLog,ix); /* get message properties */ if(IsRegexMatch(msg,MSG_MASK)){ /* is it the error we're looking for */ cpos = FindInString(msg,"Cell "); /* get the position of the word cell */ spos = FindInString(msg," ",cpos+5); /* get the position of the next space */ coords = GetStringSegment(msg,cpos+5,spos-(cpos+5)); /* get the coordinate string */ coordflags = TextToInteger(msg_props["EX"]); /* get the flags */ xycoords[1] = (0xFFFF & coordflags); /* get the row */ xycoords[0] = (0xFFFF & (coordflags >> 16)); /* get the column */ text = DataViewCellGetText(hData,xycoords[0],xycoords[1]); /* get the text of the cell */ AddMessage(log,"Found HTML at %s (Row %d,Col %d)", /* add log message */ coords,xycoords[0]+1,xycoords[1]+1); /* add log message */ AddMessage(log," Old: %s",GetStringSegment(text,0,200)); /* add log message */ wParse = WordParseCreate(WP_SGML_TAG,text); /* create the word parser */ nextword = WordParseGetWord(wParse); /* get the next word */ woutput = ""; /* wstring output */ output = ""; /* reset output */ while(nextword!=""){ /* while we have a next word */ translated=""; /* reset translated to nothing */ if(IsSGMLTag(nextword) == false){ /* if the next word is not an SGML tag */ if(IsSGMLCharacterEntity(nextword) == true){ /* if it's a character entity */ translated[0] = SGMLCharEntityToValue(hSgml, /* translate character entity */ nextword); /* translate character entity */ rc = GetLastError(); /* get the last error */ if(IsError(rc)){ /* if last code was error */ MessageBox('x', /* display error */ "Cannot translate character, error %0x",rc); /* */ return ERROR_EXIT; /* return error */ } /* */ } /* */ else{ /* if not character */ translated = AnsiToUnicode(nextword); /* add next word to output */ } /* */ if(WordParseHasSpace(wParse)){ /* if there was space before nextword */ woutput+= AnsiToUnicode(" "); /* add a space to the output */ } /* */ woutput+= translated; /* add translated to output */ } /* */ nextword = WordParseGetWord(wParse); /* get the next word */ } /* */ output = UnicodeToAnsi(woutput); /* get ANSI output for logging */ AddMessage(log," New: %s",GetStringSegment(output,0,200)); /* add log message */ AddMessage(log,""); /* add spacer to log */ hClipboard = ClipboardCreate(); /* get clipboard */ rc = ClipboardSetUnicode(hClipboard,woutput); /* set clipboard text */ if (IsError(rc)){ /* if it's an error */ MessageBox('x',"Cannot Set Clipboard text, error %0x",rc); /* display error */ ProgressClose(); /* close progress bar */ return ERROR_EXIT; /* return error */ } /* */ CloseHandle(hClipboard); /* close the clipboard handle */ rc = DataViewSetCaret(hData,xycoords[0],xycoords[1]); /* set active cell */ if (IsError(rc)){ /* if it's an error */ MessageBox('x',"Cannot Set Caret Position, error %0x",rc); /* display error */ ProgressClose(); /* close progress bar */ return ERROR_EXIT; /* return error */ } /* */ RunMenuFunction("EDIT_PASTE"); /* paste content to active cell */ fixed++; /* add to number of fixed facts */ } /* */ } /* */ } /* */ AddMessage(log,"Fixed %d facts.",fixed); /* add message */ LogDisplay(log); /* display the log */ return ERROR_EXIT; /* return no error */ } /* */ /****************************************/ int main(){ /* main */ /****************************************/ return setup(); /* run setup */ }
Steven Horowitz has been working for Novaworks for over five years as a technical expert with a focus on EDGAR HTML and XBRL. Since the creation of the Legato language in 2015, Steven has been developing scripts to improve the GoFiler user experience. He is currently working toward a Bachelor of Sciences in Software Engineering at RIT and MCC. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato