This week we’re revisiting the Align Outline Text script from LDC #41, to show how an existing script can be expanded upon to add new functionality. The original version of this script made a set of assumptions, which means it works really well for our sample test file, but there are many other cases where these assumptions are false, and the script doesn’t do anything to fix our problem. The original purpose was to examine paragraphs, try to identify the lead-in text (the “(a)”, or “1.”, or “Section 1.”, etc), and wrap it with a <FONT> tag with properties to cause it to render with a fixed width instead of relying on non-breaking spaces to make it appear spaced. Instead of looking for a very specific type of spacer, like the last script did, we can use the Word Parser in Legato to step through the lead-in word by word, examining each one, to determine how many characters appear before the first set of non-breaking spaces. This is a lot more flexible, and lets our script react better to unforseen types styles of paragraphs.
Friday, August 18. 2017
LDC #48: Improved Align Outline Text Function
For an example of the before and after of this script, check out our example here.
Our modified script:
// // // GoFiler Legato Script - Align Outlined Text // ------------------------------------------ // // Rev 06/30/2017 // 08/16/2017 // // // (c) 2017 Novaworks, LLC -- All rights reserved. // // Examines any HTML file for a paragraph followed by up to 5 words, followed by a tab or 5+ non-breaking // spaces. If it finds them, it wraps the initial words in a font tag and deletes the non-breaking // spaces. // /********************************************************/ /* Global Items */ /* ------------ */ /********************************************************/ #define NBSP "(( )|(&NBSP;)|( )|( ))" #define NBSP_ONLY "^"+NBSP+"{5,}$" #define TAB_CHAR "	" /* font tags for SM, MD, and LG values defined below */ #define FONT_TAG_SM "<FONT STYLE=\"display: inline-block; width: 0.5in; float: left; white-space:nowrap\">" #define FONT_TAG_MD "<FONT STYLE=\"display: inline-block; width: 1in; float: left; white-space:nowrap\">" #define FONT_TAG_LG "<FONT STYLE=\"display: inline-block; width: 1.5in; float: left; white-space:nowrap\">" #define UNDERLINE_TAG "<U>" #define UNDERLINE_CLOSE "</U>" #define CLOSE_FONT "</FONT>" /* threshold is used with CHARS_SM,CHARS_MD,CHARS_LG to */ /* * calculate font tag to use. */ #define CHAR_SIZE_THRESHOLD 8 /* threshold for character levels */ #define CHARS_SM 0 /* small */ #define CHARS_MD 1 /* medium */ #define CHARS_LG 2 /* large */ #define MAX_LEAD_IN_WORDS 5 /* max words in a lead-in segment */ #define MIN_NBSP_AFTER_LEAD_IN 5 /* minimum NBSP's after lead in */ int run (int f_id, string mode);/* Call from Hook Processor */ int setup (); string get_font_tag (int text_width); int get_word_length (string test); string replace_in_string (string data, string find, string replace); boolean test_font_nbsp (int psx, int psy, handle sgml, handle edit_object); boolean test_nbsp (int psx, int psy, handle sgml, handle edit_object); int counter; /****************************************/ int setup() { /* Called from Application Startup */ /****************************************/ string fnScript; /* Us */ string item[10]; /* Menu Item */ int rc; /* Return Code */ /* */ /* ** Add Menu Item */ /* * Define Function */ item["Code"] = "EXTENSION_ALIGN_OUTLINE"; /* Function Code */ item["MenuText"] = "&Align Outline Text"; /* Menu Text */ item["Description"] = "<B>Align Outline Text</B> "; /* Description (long) */ item["Description"]+= "\r\rBreaks outline out into aligned blocks.";/* * description */ /* * Check for Existing */ rc = MenuFindFunctionID(item["Code"]); /* Look for existing */ if (IsNotError(rc)) { /* Was already be added */ return ERROR_NONE; /* Exit */ } /* end error */ /* * Registration */ rc = MenuAddFunction(item); /* Add the item */ if (IsError(rc)) { /* Was already be added */ return ERROR_NONE; /* Exit */ } /* end error */ fnScript = GetScriptFilename(); /* Get the script filename */ MenuSetHook(item["Code"], fnScript, "run"); /* Set the Test Hook */ return ERROR_NONE; /* Return value (does not matter) */ } /* end setup */ /****************************************/ int main() { /* Initialize from Hook Processor */ /****************************************/ setup(); /* Add to the menu */ return ERROR_NONE; /* */ } /* end setup */ /****************************************/ int run(int f_id, string mode) { /* Call from Hook Processor */ /****************************************/ int px,py; /* start pos of paragraph text */ int text_width; /* the width of the lead in text */ int ex,ey,sx,sy; /* positional variables */ boolean matches; /* if matches a replace pattern */ dword type; /* type of window */ string font_tag; /* font tag to write */ string content; /* content of an SGML tag */ string closetag; /* closing tag to write out */ string element; /* sgml element */ handle sgml; /* sgml object */ handle edit_object; /* edit object */ handle edit_window; /* edit window handle */ string text; /* closing element of sgml object */ /* */ if (mode!="preprocess"){ /* if mode is not preprocess */ return ERROR_NONE; /* return no error */ } /* */ counter = 0; /* reset counter from last run */ edit_window = GetActiveEditWindow(); /* get handle to edit window */ if(IsError(edit_window)){ /* get active edit window */ MessageBox('x',"Cannot get edit window."); /* display error */ return ERROR_EXIT; /* return */ } /* */ type = GetEditWindowType(edit_window) & EDX_TYPE_ID_MASK; /* get the type of the window */ if (type!=EDX_TYPE_PSG_PAGE_VIEW && type!=EDX_TYPE_PSG_TEXT_VIEW){ /* and make sure type is HTML or Code */ MessageBox('x',"This is not an HTML edit window."); /* display error */ return ERROR_EXIT; /* return error */ } /* */ edit_object = GetEditObject(edit_window); /* create edit object */ sgml = SGMLCreate(edit_object); /* create sgml object */ element = SGMLNextElement(sgml); /* get the first sgml element */ while(element != ""){ /* while element isn't empty */ if (IsError(element)){ /* if it couldn't read the element */ MessageBox('x',"Could not read HTML element, aborting."); /* print error */ return ERROR_EXIT; /* return error */ } /* */ if (FindInString(element, "<p", 0, false)>(-1)){ /* if the element is a paragraph */ matches = true; /* reset matches */ sx = SGMLGetItemPosSX(sgml); /* get sgml start */ sy = SGMLGetItemPosSY(sgml); /* get sgml start */ ex = SGMLGetItemPosEX(sgml); /* set sgml end */ ey = SGMLGetItemPosEY(sgml); /* set sgml end */ switch (matches){ /* switch on boolean matches */ case matches = test_nbsp(sx, sy, sgml, edit_object): /* try case 2 */ break; /* end */ case matches = test_font_nbsp(ex, ey, sgml, edit_object): /* try case 1 */ break; /* end */ default: /* if nothing happened */ SGMLSetPosition(sgml, ex, ey); /* set to end of item */ break; /* */ } /* */ } /* */ element = SGMLNextElement(sgml); /* get the next sgml element */ } /* */ CloseHandle(edit_object); /* close edit object */ MessageBox('i',"Found and modified %d paragraphs.",counter); /* messagebox */ return ERROR_NONE; /* Exit Done */ } /* end setup */ /****************************************/ boolean test_font_nbsp(int psx, int psy, handle sgml, handle edit_object){/* tests for spacers in font tag */ /****************************************/ int px,py; /* start pos of paragraph text */ int text_width; /* the width of the lead in text */ int ex,ey,sx,sy; /* positional variables */ string font_tag; /* font tag to write */ string content; /* content of an SGML tag */ string closetag; /* closing tag to write out */ string element; /* sgml element */ string text; /* closing element of sgml object */ /* */ SGMLSetPosition(sgml,psx,psy); /* reset to start of paragraph */ px = psx; /* store prior start */ py = psy; /* store prior end */ element = SGMLNextElement(sgml); /* get the next element */ while (FindInString(element, "<font", 0, false)<0 && /* while not a font tag */ element!="</p>" && element!=""){ /* and not at the end of P */ if (FindInString(element, "<a", 0, false)>(-1)){ /* if next element is an anchor */ SGMLNextElement(sgml); /* advance 2 times */ px = SGMLGetItemPosEX(sgml); /* get px */ py = SGMLGetItemPosEY(sgml); /* get py */ element = SGMLNextElement(sgml); /* advance 2 times */ } /* */ else{ /* */ break; /* if not an A tag, break */ } /* */ } /* */ if (FindInString(element, "<font", 0, false)>(-1)){ /* if the next element is a font tag */ sx = SGMLGetItemPosSX(sgml); /* start of font tag */ sy = SGMLGetItemPosSY(sgml); /* start of font tag */ content = ReadSegment(edit_object,px,py,sx,sy); /* get content of lead-in */ text_width = GetStringLength(content); /* get width of text */ content = SGMLFindClosingElement(sgml,SP_FCE_CODE_AS_IS); /* get the content of the font tag */ content = TrimPadding(content); /* remove leading / trailing space */ if (IsRegexMatch(content, NBSP_ONLY)){ /* check if font tag is only NBSP's */ SGMLFindClosingElement(sgml); /* move to close of tag */ ex = SGMLGetItemPosEX(sgml); /* end of font tag */ ey = SGMLGetItemPosEY(sgml); /* end of font tag */ font_tag = get_font_tag(text_width); /* get font tag to use */ if (font_tag!=""){ /* if we have a font tag to use */ WriteSegment(edit_object,"",sx,sy,ex,ey); /* remove font tag */ WriteSegment(edit_object,CLOSE_FONT,sx,sy); /* write close font tag */ WriteSegment(edit_object,font_tag,px,py,px,py); /* write begin font tag */ SGMLSetPosition(sgml,px,py); /* set SGML position */ counter++; /* increment count */ return true; /* return true */ } /* */ } /* */ } /* */ return false; /* return false */ } /****************************************/ boolean test_nbsp(int psx, int psy, handle sgml, handle edit_object){ /* test if only NBSP's in after leadin */ /****************************************/ handle wp; /* word parser */ int lct; /* lead in count */ int nct; /* nbsp count */ int sx,sy; /* startpoints of paragraph content */ int ex,ey; /* endpoints of paragraph content */ int text_width; /* width of lead in text */ boolean is_tab; /* true if dealing with tab char */ boolean ended_lead_in; /* has lead in ended? */ boolean last_word_nbsp; /* was last char nbsp? */ boolean is_nbsp; /* is this char an nbsp? */ boolean underline; /* true if underlining */ string u_tag; /* underline tag */ string u_close; /* close underline tag */ string tag_type; /* type of tag to look for */ string font_tag; /* font tag to use */ string last_word; /* the last word ` */ string lead_in; /* lead in string */ string nbsp_string; /* nbsp_string */ string wp_word; /* parsed word */ string content; /* content of paragraph */ string new_lead_in; /* new lead-in wrapped in font tags */ string element; /* next element */ /* */ SGMLSetPosition(sgml,psx,psy); /* reset to start of paragraph */ element = SGMLNextElement(sgml); /* get next SGML element */ sx = SGMLGetItemPosEX(sgml); /* get end of P tag */ sy = SGMLGetItemPosEY(sgml); /* get end of P tag */ content = SGMLFindClosingElement(sgml, SP_FCE_CODE_AS_IS); /* get content of paragraph */ ex = SGMLGetItemPosSX(sgml); /* get start of close tag */ ey = SGMLGetItemPosSY(sgml); /* get start of close tag */ wp = WordParseCreate(WP_SGML_TAG,content); /* create parser for content of para */ wp_word = WordParseGetWord(wp); /* get next word */ while(wp_word!=""){ /* while we have a next word */ if (IsRegexMatch(wp_word,NBSP)){ /* test if nbsp */ is_nbsp = true; /* its a nbsp */ } /* */ else{ /* if it's not an nbsp */ if (wp_word==TAB_CHAR){ /* if it's a tab char */ is_tab = true; /* remember we've got a tab */ } /* */ is_nbsp = false; /* store value */ } /* */ if (IsSGMLTag(wp_word)==false){ /* if it's not an SGML tag */ if (!ended_lead_in){ /* if lead in hasn't ended */ if (is_tab){ /* if it's a tab character */ nbsp_string = wp_word; /* add char to space string */ nct = 5; /* counts as 5 spaces */ break; /* stop processing words */ } /* */ if (last_word_nbsp == true && is_nbsp == false){ /* if last word was nbsp, but not eoli */ lead_in = GetStringSegment(lead_in,0, /* remove extra space from lead-in */ GetStringLength(lead_in)-1); /* * remove extra space */ lead_in += last_word + wp_word; /* add last word and this word to string*/ text_width+=get_word_length(last_word)+ /* add last_word width to width */ get_word_length(wp_word); /* add wp_word width to width */ last_word = ""; /* reset last word */ last_word_nbsp = false; /* reset last word nbsp */ lct+=2; /* increment lead in counter by 2 */ } /* */ else { /* */ if (last_word_nbsp == false && is_nbsp == false){ /* if neither word was nbsp */ lead_in += wp_word+" "; /* add word to lead_in */ text_width+= get_word_length(wp_word); /* add length of word to word length */ lct ++; /* increment lead in word count */ } /* */ } /* */ if(last_word_nbsp == true && is_nbsp){ /* if this and last word were nbsps */ ended_lead_in = true; /* lead in is over */ nbsp_string = last_word + wp_word; /* store as start of nbsp string */ nct = 2; /* non breaking space count is 2 */ last_word = wp_word; /* store word as last word */ } /* */ if (last_word_nbsp == false && is_nbsp){ /* if nbsp but last word was not */ last_word = wp_word; /* store last word */ last_word_nbsp = true; /* remember last word is nbsp */ } /* */ } /* */ else{ /* if lead in has ended */ if (is_nbsp){ /* if it's a nbsp */ nbsp_string += wp_word; /* add to nbsp string */ nct++; /* increment non-breaking space counter */ } /* */ else{ /* if lead in is over and not nbsp */ break; /* break out of loop */ } /* */ } /* */ } /* */ else{ /* if it is an SGML tag */ if (!ended_lead_in){ /* if still inside lead_in */ if (FindInString(wp_word,"<U")>(-1)){ /* if it's an underline */ underline = true; /* set underlined to true */ } /* */ } /* */ } /* */ wp_word = WordParseGetWord(wp); /* gets the next word */ } /* */ if (lct <= MAX_LEAD_IN_WORDS && nct >= MIN_NBSP_AFTER_LEAD_IN){ /* if it meets criteria for lead-in */ font_tag = get_font_tag(text_width); /* get font tag to use */ lead_in = TrimString(lead_in); /* remove leading / trailing spaces */ if (underline){ /* if underlining */ u_tag = UNDERLINE_TAG; /* underline tag to write out */ u_close = UNDERLINE_CLOSE; /* underline close to write out */ } /* */ else{ /* if not underlining */ u_tag = ""; /* blank tag */ u_close = ""; /* blank tag */ } /* */ new_lead_in = font_tag + u_tag + lead_in + u_close + CLOSE_FONT; /* wrap lead in with new font tag */ content = replace_in_string(content,lead_in,new_lead_in); /* replace lead in */ if (content==""){ /* if nothing was replaced */ return false; /* return false */ } /* */ content = ReplaceInString(content,nbsp_string,""); /* remove nbsp's */ WriteSegment(edit_object,content,sx,sy,ex,ey); /* write segment out */ counter++; /* increment counter */ CloseHandle(wp); /* close handle */ return true; /* return that we did something */ } /* */ CloseHandle(wp); /* close handle */ return false; /* */ } /****************************************/ string get_font_tag(int text_width){ /* return appropriate font tag */ /****************************************/ string font_tag; /* font tag to return */ /* */ switch (text_width/CHAR_SIZE_THRESHOLD){ /* switch on width of lead-in text */ case (CHARS_SM): /* if small */ font_tag = FONT_TAG_SM; /* use small tag */ break; /* break switch */ case (CHARS_MD): /* if medium */ font_tag = FONT_TAG_MD; /* use medium font tag */ break; /* break switch */ case (CHARS_LG): /* if large */ font_tag = FONT_TAG_LG; /* use large font tag */ break; /* break */ default: /* if none of the above */ font_tag = ""; /* do not set a font tag */ break; /* break */ } /* */ return font_tag; /* return selected value */ } /****************************************/ int get_word_length(string test){ /* gets the length of a word */ /****************************************/ if (IsSGMLCharacterEntity(test)){ /* test if it's a character entity */ return 1; /* counts as 1 character */ } /* */ if (IsSGMLTag(test)){ /* if it's an SGML tag */ return 0; /* doesn't render, has zero width */ } /* return true */ return GetStringLength(test); /* return default length of word */ return ERROR_NONE; /* Return value (does not matter) */ } /****************************************/ string replace_in_string(string data, string find, string replace){ /* replace the first occurrence in a str*/ /****************************************/ int begin; /* beginning of the replaced segment */ int length; /* length of segment to replace */ string front; /* front part of my new string */ string back; /* back part of my new string */ /* */ length = GetStringLength(find); /* length of segment to replace */ begin = InString(data,GetStringSegment(find,0,1)); /* find start of replace string in data */ if (begin<0){ /* if not found */ return ""; /* return nothing */ } /* */ front = GetStringSegment(data,0,begin); /* get front of new string */ back = GetStringSegment(data,begin+length); /* get back of new string */ return front+replace+back; /* return new string */ } /* */
Let’s start with a few utility functions, which are used in the new version of this script. The first of these functions is get_font_tag. This function returns the appropriate FONT tag given the text width.
/****************************************/ string get_font_tag(int text_width){ /* return appropriate font tag */ /****************************************/ string font_tag; /* font tag to return */ /* */ switch (text_width/CHAR_SIZE_THRESHOLD){ /* switch on width of lead-in text */ case (CHARS_SM): /* if small */ font_tag = FONT_TAG_SM; /* use small tag */ break; /* break switch */ case (CHARS_MD): /* if medium */ font_tag = FONT_TAG_MD; /* use medium font tag */ break; /* break switch */ case (CHARS_LG): /* if large */ font_tag = FONT_TAG_LG; /* use large font tag */ break; /* break */ default: /* if none of the above */ font_tag = ""; /* do not set a font tag */ break; /* break */ } /* */ return font_tag; /* return selected value */ }
A lead-in can vary in size, so we want to make sure the FONT tag has enough spacing in it to accommodate multiple sizes. This function uses a pretty simple switch statement, that switches on the width of the lead-in text divided by a defined threshold value. If the text_width divides into the threshold value an expected number of times (either CHARS_SM, CHARS_MD, or CHARS_LG), then it returns the corresponding small, medium, or large font tag, defined at the top of the script. Otherwise it returns nothing, so no font tag gets written out at all. It’s better to write nothing than to write out something too small to fit what we need. Defines are used here intentionally, so we can tweak things simply by modifying values at the top of the script. While writing this, I tweaked the initial version of this script’s value of CHAR_SIZE_THRESHOLD from 10 to 8 because I found 10 was slightly too large when testing. Using defines makes changes like this very easy to do.
The next utility function is get_word_length. This function is relatively straight forward as it returns the length of a word.
/****************************************/ int get_word_length(string test){ /* gets the length of a word */ /****************************************/ if (IsSGMLCharacterEntity(test)){ /* test if it's a character entity */ return 1; /* counts as 1 character */ } /* */ if (IsSGMLTag(test)){ /* if it's an SGML tag */ return 0; /* doesn't render, has zero width */ } /* return true */ return GetStringLength(test); /* return default length of word */ }
Just using GetStringLength is not accurate because HTML tags like <B> will not effect the length of render text and character entities are multiple characters in code but only a single character when rendered by the browser.
/****************************************/ string replace_in_string(string data, string find, string replace){ /* replace the first occurrence in a str*/ /****************************************/ int begin; /* beginning of the replaced segment */ int length; /* length of segment to replace */ string front; /* front part of my new string */ string back; /* back part of my new string */ /* */ length = GetStringLength(find); /* length of segment to replace */ begin = InString(data,GetStringSegment(find,0,1)); /* find start of replace string in data */ if (begin<0){ /* if not found */ return ""; /* return nothing */ } /* */ front = GetStringSegment(data,0,begin); /* get front of new string */ back = GetStringSegment(data,begin+length); /* get back of new string */ return front+replace+back; /* return new string */ } /* */
The replace_in_string function is used instead of the normal ReplaceInString function in Legato, because we only want it to replace the first occurrence of the find variable instead of every occurrence. The function starts by getting the length of the string we’re replacing with the GetStringLength function. By using InString, we can then get the position of the first character of our find string. In this specific case, the first letter of our find variable should always be the position of the start of the string we’re looking for, so this works. Initially this script was written using InString with the full find string, but that doesn’t work if there are returns in the lead-in, so this method works fine. If it’s not found (which shouldn’t be the case) we return nothing and end. Otherwise, we get the substring from the beginning of our string to the beginning of the replaced segment and store this as front. Then we get the string segment from the beginning of the found item plus the length of our found string to the end of our data string and store the result in back. The variables front and back are now the text before and after find, respectively. We can add this front segment to our new string, then add the back segment back onto it, creating a new string that is the old one with the find string replaced.
Now that we’ve covered the utility functions let’s discuss the changes to how the script works starting with the run function.
/****************************************/ int run(int f_id, string mode) { /* Call from Hook Processor */ /****************************************/ ....omitted code..... if (FindInString(element, "<p", 0, false)>(-1)){ /* if the element is a paragraph */ matches = true; /* reset matches */ sx = SGMLGetItemPosSX(sgml); /* get sgml start */ sy = SGMLGetItemPosSY(sgml); /* get sgml start */ ex = SGMLGetItemPosEX(sgml); /* set sgml end */ ey = SGMLGetItemPosEY(sgml); /* set sgml end */ switch (matches){ /* switch on boolean matches */ case matches = test_nbsp(sx, sy, sgml, edit_object): /* try case 2 */ break; /* end */ case matches = test_font_nbsp(ex, ey, sgml, edit_object): /* try case 1 */ break; /* end */ default: /* if nothing happened */ SGMLSetPosition(sgml, ex, ey); /* set to end of item */ break; /* */ } /* */ } /* */ element = SGMLNextElement(sgml); /* get the next sgml element */ } /* */ CloseHandle(edit_object); /* close edit object */ MessageBox('i',"Found and modified %d paragraphs.",counter); /* messagebox */ return ERROR_NONE; /* Exit Done */ } /* end setup */
I’m abbreviating the run function here, because the first portion of it is discussed in LDC #41 in depth. The new portion occurs inside the if statement that is true if the element we’re parsing is a paragraph tag. If it’s a paragraph, we set matches to true, get our start and end positions for the element, and switch on matches. Each case in this switch is a different test function. Legato tests each case statement in order against the value of matches. Because matches is originally true, if the test function run by the case works it the switch will stop, if not Legato tries the next case. Our tests are going to move the SGML parser, since they will modify the text if successful. Because of this we need to pass start positions for the test so the test can first reset the parse position. The test_nbsp function is the new and improved test for this round of the script. The test_font_nbsp contains the same logic from the previous blog post, and is included to ensure we don’t remove any functionality. New tests can be easily added by modifying this switch. After all test cases are run, we get the next element, and keep going. When all elements in the document are examined, the edit_object is closed, we report back how many paragraphs were modified, and return without error.
Let’s look at the new test case:
/****************************************/ boolean test_nbsp(int psx, int psy, handle sgml, handle edit_object){ /* test if only NBSP's in after leadin */ /****************************************/ .... omitted instant variable declarations .... /* */ SGMLSetPosition(sgml,psx,psy); /* reset to start of paragraph */ element = SGMLNextElement(sgml); /* get next SGML element */ sx = SGMLGetItemPosEX(sgml); /* get end of P tag */ sy = SGMLGetItemPosEY(sgml); /* get end of P tag */ content = SGMLFindClosingElement(sgml, SP_FCE_CODE_AS_IS); /* get content of paragraph */ ex = SGMLGetItemPosSX(sgml); /* get start of close tag */ ey = SGMLGetItemPosSY(sgml); /* get start of close tag */ wp = WordParseCreate(WP_SGML_TAG,content); /* create parser for content of para */ wp_word = WordParseGetWord(wp); /* get next word */ while(wp_word!=""){ /* while we have a next word */ if (IsRegexMatch(wp_word,NBSP)){ /* test if nbsp */ is_nbsp = true; /* its a nbsp */ } /* */ else{ /* if it's not an nbsp */ if (wp_word==TAB_CHAR){ /* if it's a tab char */ is_tab = true; /* remember we've got a tab */ } /* */ is_nbsp = false; /* store value */ } /* */
The primary new function is test_nbsp. It takes a start X and start Y value, the SGML parser handle, and the Edit Object handle as inputs, and will try to modify the outline as required. The first thing it does is set the parse position to the starting position passed to the function. Then it gets the next SGML tag, which should be a paragraph. Then it gets the content of that tag by using the SGMLFindClosingElement function to get all the content between the beginning and the ending of the block. We can use a Word Parse object to iterate over it, word by word. The function WordParseCreate gets a handle to the word parser, with the WP_SGML_TAG flag set it will stop on tags and character entities, treating them as whole words. While we have a word, we can run a series of if/else statements to determine the state of the paragraph, and what part is the lead in and what part is the spacer block. First, we use IsRegexMatch to test if it’s a non-breaking space. If it is, we set the is_nbsp flag to true. Otherwise, we test for a tab character.
if (IsSGMLTag(wp_word)==false){ /* if it's not an SGML tag */ if (!ended_lead_in){ /* if lead in hasn't ended */ if (is_tab){ /* if it's a tab character */ nbsp_string = wp_word; /* add char to space string */ nct = 5; /* counts as 5 spaces */ break; /* stop processing words */ } /* */
If the word from the parser is an SGML tag, we can skip down to here. Otherwise, we need to check if our lead-in has ended yet. If it has, great, we can go here. If not, we need to check if the word is a tab. If it’s a tab, we set the spacer string nbsp_string to the tab character, and set the number of non-breaking spaces (nct) to 5, then break out of our loop because we’re done processing the lead-in and spacers.
if (last_word_nbsp == true && is_nbsp == false){ /* if last word was nbsp, but not eoli */ lead_in = GetStringSegment(lead_in,0, /* remove extra space from lead-in */ GetStringLength(lead_in)-1); /* * remove extra space */ lead_in += last_word + wp_word; /* add last word and this word to string*/ text_width+=get_word_length(last_word)+ /* add last_word width to width */ get_word_length(wp_word); /* add wp_word width to width */ last_word = ""; /* reset last word */ last_word_nbsp = false; /* reset last word nbsp */ lct+=2; /* increment lead in counter by 2 */ } /* */
If we’re still processing the contents of this word, we can check if the last word was a non-breaking space. If it was, and the current word is NOT a non-breaking space, we need to add both words to our lead-in text string we’re building. The last space was a non-breaking space, so we can remove the last character of the lead-in, which should just be an extra spacer. We can then add the last word and the current word to our lead-in, and add the word length (using our get_word_length subroutine) to our text_width variable. Then we can reset the last_word to blank, the last_word_nbsp to false, and increment the number of words in the lead in (lct) by 2, because we just added 2 words to it. This logic is required in case the lead-in text has a non-breaking space between two words.
else { /* */ if (last_word_nbsp == false && is_nbsp == false){ /* if neither word was nbsp */ lead_in += wp_word+" "; /* add word to lead_in */ text_width+= get_word_length(wp_word); /* add length of word to word length */ lct ++; /* increment lead in word count */ } /* */ } /* */
If neither this word nor the last word were non-breaking spaces, we can simply append the word to our lead-in string we’re building along with a blank space, and increment the text_width by the length of our word, and increment the lead_in word counter (lct) by 1.
if(last_word_nbsp == true && is_nbsp){ /* if this and last word were nbsps */ ended_lead_in = true; /* lead in is over */ nbsp_string = last_word + wp_word; /* store as start of nbsp string */ nct = 2; /* non breaking space count is 2 */ last_word = wp_word; /* store word as last word */ } /* */
If the last word was not a non-breaking space, or the current word is a non-breaking space, we can test if both this and the last word were non-breaking spaces. If so, then our lead-in is over, so we set ended_lead_in to true, add the non-breaking spaces to our nbsp_string, set the number of non-breaking spaces to 2, and store the last word as the current word we are processing.
if (last_word_nbsp == false && is_nbsp){ /* if nbsp but last word was not */ last_word = wp_word; /* store last word */ last_word_nbsp = true; /* remember last word is nbsp */ } /* */ } /* */
If the last word is not a non-breaking space, but the current one is, then we need to simply store the last word as the current word, and remember that the last word is a non-breaking space, so it can be added to either the nbsp_string or the lead_in strings on the next loop around.
else{ /* if lead in has ended */ if (is_nbsp){ /* if it's a nbsp */ nbsp_string += wp_word; /* add to nbsp string */ nct++; /* increment non-breaking space counter */ } /* */ else{ /* if lead in is over and not nbsp */ break; /* break out of loop */ } /* */ } /* */ } /* */
Now that our lead-in has ended, we can test if the current word is a non-breaking space. If it is, we add it to our nbsp_string, and increment the counter of non-breaking spaces. Otherwise, we can just break and end the loop now, because our spacer string is over and we’ve begun actual contents of the paragraph.
else{ /* if it is an SGML tag */ if (!ended_lead_in){ /* if still inside lead_in */ if (FindInString(wp_word,"<U")>(-1)){ /* if it's an underline */ underline = true; /* set underlined to true */ } /* */ } /* */ } /* */ wp_word = WordParseGetWord(wp); /* gets the next word */ } /* */
We have an SGML tag instead of a normal word or character entity, we don’t want to add it to our lead_in string. Instead, we just need to check if the lead-in is still being written. If it is, we need to check if it’s a U (underline) tag. If it is, we need to flag it as having the underline property, so when we write out our font tags we can wrap the lead-in with underline tags of our own. The font tag we’re using has a style of display inline-block, so it will block out the underline style unless we do this.
if (lct <= MAX_LEAD_IN_WORDS && nct >= MIN_NBSP_AFTER_LEAD_IN){ /* if it meets criteria for lead-in */ font_tag = get_font_tag(text_width); /* get font tag to use */ lead_in = TrimString(lead_in); /* remove leading / trailing spaces */ if (underline){ /* if underlining */ u_tag = UNDERLINE_TAG; /* underline tag to write out */ u_close = UNDERLINE_CLOSE; /* underline close to write out */ } /* */ else{ /* if not underlining */ u_tag = ""; /* blank tag */ u_close = ""; /* blank tag */ } /* */ new_lead_in = font_tag + u_tag + lead_in + u_close + CLOSE_FONT; /* wrap lead in with new font tag */ content = replace_in_string(content,lead_in,new_lead_in); /* replace lead in */ if (content==""){ /* if nothing was replaced */ return false; /* return false */ } /* */ content = ReplaceInString(content,nbsp_string,""); /* remove nbsp's */ WriteSegment(edit_object,content,sx,sy,ex,ey); /* write segment out */ counter++; /* increment counter */ CloseHandle(wp); /* close handle */ return true; /* return that we did something */ } /* */ CloseHandle(wp); /* close handle */ return false; /* */ }
Now that we’re done testing things in our paragraph, and we actually have built our lead-in string and the spacer string that follows it, we can check it against our threshold defines to see if it actually counts as a paragraph we want to alter. If the lead-in string has fewer than MAX_LEAD_IN_WORDS (I set it to 5) and there are more than or equal to MIN_NBSP_AFTER_LEAD_IN (also set to 5), then it counts as an outline paragraph with a heading we want to modify. We get the font tag we are going to use with the get_font_tag function, trim our lead-in of leading or trailing whitespace, and then test if underline is true or not. If so, we’re going to use the defines UNDERLINE_TAG and UNDERLINE_CLOSE for our underline tags. Otherwise, they are left blank. We can then build our new lead-in string by adding the font tag, the open underline, the existing lead-in, the close underline, and the close font. Once we have our new lead-in, we can use our replace_in_string function to replace the old lead-in with the new one, and use the ReplaceInString function to replace our spacer string with nothing, because it’s been replaced by our font tag spacer. Using WriteSegment, we can write it back out to the screen, close our handle to the Word Parser, and then return true, because we did change things. If it didn’t meet our thresholds, we can just close the Word Parser, and return false, because nothing changed.
This script can definitely be improved even further. It could really use a UI to edit the defined values at the top of the script, instead of having the manually edit the script file, which is beyond most users. Right now, it only detects underlines as a potential style being lost, when it should look for any font tags with the text-decoration property and try to duplicate that value. It also needs to be compliant with more styles of HTML, it should probably write out the font tags based on the DTD of the selected document, not off of defined values it’s using. None of these changes would be simple, however, and as always we need to account for how long it would take to implement them versus how much we could get out of them. Spending a week developing a script to save five minutes of work is never a good idea.
Steven Horowitz has been working for Novaworks for over five years as a technical expert with a focus on EDGAR HTML and XBRL. Since the creation of the Legato language in 2015, Steven has been developing scripts to improve the GoFiler user experience. He is currently working toward a Bachelor of Sciences in Software Engineering at RIT and MCC. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato