Legato
Legato

GoFiler Legato Script Reference

 

Legato v 1.5e

Application v 5.25b

  

 

Chapter ElevenSGML Functions

11.1 Introduction to SGML Support

11.1.1 General

Legato provides for a couple of levels of Standard Generalized Markup Language (SGML) support. The SGML umbrella covers HTML, XML, XBRL and a number of other SGML style formats.

SGML support is broken into a SGML Object for reading and writing SGML data, an HTML Writer Object for easily generating HTML code, HTML Table Object and an HTML Outline Object.

11.1.2 Terminology

Within the world of SGML, HTML, XML and Legato, there are a number of common terms to be familiar with:

tagA collection of an element and zero or attribute value pairs surrounded by ‘<‘ and ‘>‘ angle brackets. For example:

<P>

Specifies the start of a paragraph block in HTML.

elementThe element is the text that defines the tag. In the above example the element is ‘P’. For HTML elements are not case-sensitive, for XML they are. Elements can be referenced by string or token.
attributeAttributes specify zero or more data parameters associated with an element. For example:

<P ALIGN="CENTER">

Specifies an alignment of center. SGML attributes are always in the form of name=data except for certain boolean attributes such as NOWRAP which can have no value or be explicitly NOWRAP=NOWRAP. Attributes are separated from the leading element and other attributes by white space.

Values are normally quoted with single or double straight quotes. If the data contains spaces it must be quoted. By default, the SGML functions will always write data with quotes.

When referencing an attribute it is known as a parameter. Within this documentation attributes are always upper case.

propertyBecause HTML has CSS interlaced, SGML function processes CSS properties in the same manner as attributes. An alternative to the above example:

<P STYLE="text-align: center">

In this case, the STYLE attribute STYLE contains one or more CSS property: value pairs. Property-value pairs are separated by a semicolon. Data can be further quoted as required within the CSS value. Property names are not case-sensitive.

When referencing a CSS property, it is also known as a parameter. Within this documentation attributes are always lower case.

Tokens are generally used to reference parameters with attributes properties being of different classes to avoid confusion. If referencing by string name, conflicts can arise. For example, HTML has an attribute known as ‘COLOR’ and CSS property as ‘color’

PCDATAParsed Character (PCDATA) is generally the information between the tags. Certain characters must be represented as character entities. For example, the chevron characters < > must be represented as &lt; and &gt; character entities if used as text to be expressed as text within the document and differentiated from tags. Similarly, the & character must also be represented as &amp; to avoid being confused with a badly formatted character entity.
fieldA field is a proprietary special HTML comment structured as high-level control data for publishing and document control. They are structured by a combination HTML/CSS element. Fields are also structure and nesting like elements.

Review the W3C and other sources for more information and terms.

11.1.3 Namespace

Another aspect of XML and XHTML is the use of namespaces. Namespaces effectively classify elements and attribute names allowing for segmentation and grouping. While the underlying SGML class has namespace support, it is not exposed in version 1 of Legato. For text and HTML processing, namespace processing is not required.

11.1.4 SGML Object

A central focus of HTML and SGML functionality and support is the SGML Object. It is an underlying class used throughout the supporting application to process HTML and XML data.

11.1.5 Special Data Types

SGML and related object and functions use a couple of predefined SDK data types to aid in documentation and programming clarity. These are PVALUE and TOKEN for parameter value and token value. Each define to a dword (a 32-bit unsigned value). When specified as a data type they will be italic and when used in general discussion they will be plain text.

The SGML Object and element class are programmed to support HTML and CSS data types. Therefore, if an attribute was equal to ‘12%’ or a property is set to ‘0.125in’, the string value is translated automatically into a more useful pvalue format. They can be stored as simple strings but this significantly reduces the utility of the data. The design is meant to represent data in standard HTML and CSS units with a practical level of precision. For most values, this is 100ths. For one unit type, inches, the resolution is 10000th units. For document publishing, 0.01mm is a very fine resolution while 10000ths is needed for inches since representing an 1/8 of an inch requires a higher level of precision.

11.1.6 PVALUE Data Type

The PVALUE type is a useful structure for quickly and easily representing complex HTML and SGML data types. A pvalue is a 32-bit value structured in a manner for easy storage and transport. The keyword PVALUE is defined as a dword data type.

The data type considers CSS measurements, HTML and CSS keywords, strings (CDATA) and other information. Internally pvalues are a bitwise arrangement of a Parameter Type (PT_) and parameter data. The top 5 bits (PT_MASK or 0xF8000000) specify the type of data being represented such as “inches” or “degrees”. The lower 27 bits represent the data (PT_VALUE_MASK or 0x07FFFFFF) of which the structure depends on the type of data.

Six classes of data are represented: strings, arrays, errors, measurements, colors and keywords. The top 5 bits of the dword (0xF8000000 or PT_MASK) determine the type of data contained in the pvalue. The remaining bits are split to a value which can contain flags and other information. For simple measurements, the lower 27 bit portion contains a signed integer in 100ths or 10000th precision.

When used in the SGML Object, strings, arrays and error data is stored on the element class heap. In such a case, the lower portion is a heap offset. The script does not have direct access to the heap but rather uses the high-level SGMLGetParameter and SGMLSetParameter style functions to access the data. The text of error and string is managed in the same manner with a heap offset to a zero-terminated text string. However, errors also contain Simple Type Error codes as a way of representing an error with a simple classification.

For most measurement types, the lower data is an integer with the decimal place shifted 2 positions to allow for a resolution of 100ths. For inches, the resolution is set to 10000ths with the decimal shifted 4 places. (There is a version of inches as 100ths but a measurement such as 1/8 of an inch cannot be represented with 100ths).

The SDK values are defined as follows:

  Defined Data   Value   Description
  Parameter Control        
  Masks        
    PT_MASK   0xF8000000   Parameter Type Mask — Bitwise AND with this value reveals the underlying parameter type.
    PT_VALUE_MASK   0x07FFFFFF   Value Mask — Bitwise AND expresses the associated value.
    PT_HEAP_MASK   0x0000FFFF   Mask for Data on Heap Mask — For values that are strings or arrays, the bitwise AND reveals the heap offset.
    PT_KEYWORD_MASK   0x0000FFFF   Ordinal Value Mask for Keyword
  Signed Numbers      
    PT_SIGN_BIT   0x04000000   Sign, Sign Extend Bit, Data Type — If set the value is a negative number.
    PT_SIGN_EXTEND   0xF8000000   OR to Extend Data Sign
  Non-Value Conditions      
    PT_IMPLIED   0xFFFFFFFF   Value is Implied (default)
    PT_MIXED   0xFFFFFFFE   Mixed Condition (multiple items)
    PT_UNTRANSLATED   0xFFFFFFFD   Value Expected to be Translated — The value required action or failed on translation. This value can be returned by failed math operations.
    PT_STRING   0xF8000000   Offset to String on Heap
    PT_STRING_SIZE   0x07FF0000   Size of Item on Heap (must be shifted)
    PT_ARRAY   0xE8000000   Offset to Array Data on Heap
    PT_ARRAY_COMMA   0x02000000   If Set, Array Entries Comma Delimited
    PT_ARRAY_COUNT   0x01FF0000   Mask to Count of PT_ on Heap
  Errors (on heap)      
  Error Control        
    PT_ERROR   0xD8000000   Error Data on Heap (Error : String)
    PT_ERROR_MASK   0x07FF0000   Mark for Error Type
    PT_ERROR_NO_DETAIL   0x0000FFFF   No Offset for Detail Error String
  Simple Error Type Codes      
    PT_ERROR_NONE   0x00000000   No Error in Value
    PT_ERROR_SYNTAX   0x00010000   Item Fails on Syntax
    PT_ERROR_QUOTE   0x00020000   Failure to Close Quote
    PT_ERROR_UNITS   0x00030000   Inappropriate Units
    PT_ERROR_RANGE   0x00040000   Value Out of Range
    PT_ERROR_SIZE   0x00050000   Value to Big
    PT_ERROR_KEYWORD   0x00060000   Invalid Keyword
    PT_ERROR_REQUIRED   0x00070000   Value Required
    PT_ERROR_DUPLICATE   0x00080000   Value Duplicated Elsewhere
    PT_ERROR_OVERFLOW   0x00090000   Value Overflows Internal Data
    PT_ERROR_WHOLE_UNITS   0x000A0000   Values May Be Whole Only
    PT_ERROR_UNKNOWN_UNITS   0x000B0000   Unknown Units
    PT_ERROR_CONFLICT   0x000C0000   Conflicting Parameters
    PT_ERROR_CSS_PROPERTY_NAME   0x000D0000   Unknown CSS Property Name
    PT_ERROR_CSS_UNKNOWN_SH_ITEM   0x000E0000   Unknown Item (CSS shorthand)
    PT_ERROR_HEAP_OVERFLOW   0x04000000   Internal Heap Overflow (no offset)
  Warnings        
    PT_WARNING_FRACTIONAL_UNITS   0x01010000   Fractional Units Not Allowed
  Parameter Types      
  SGML        
    PT_INT   0x00000000   Unsigned Integer/Number (i.e., 23.23)
    PT_SIGNED_INT   0x08000000   Signed Integer/Number (+/- i.e., -2.2, +7)
    PT_PERCENT   0x18000000   Percentage (i.e., 43.00%)
    PT_RGB   0x28000000   Color (24-bit RGB | string)
    PT_RGB_MASK   0x00FFFFFF   Mask for Heap or Color
    PT_RGB_HEAP_FLAG   0x02000000   Color Flag, Value on Heap XXXX/ss
    PT_KEYWORD   0x38000000   Keyword Token — The value is the ordinal which is dependent on the attribute or property defined in the DTD for HTML.
    PT_KEYWORD_MASK   0x0000FFFF   Keyword Mask (16-bit)
    PT_CHAR   0x48000000   Character (8-bit ANSI)
    PT_CHAR_MASK   0x000000FF   Character Mask
    PT_BOOL   0x58000000   Boolean (i.e., CHECHED=CHECKED)
  CSS Size as Metric      
    PT_MM   0x10000000   Millimeters (+/- i.e., 12.22mm)
    PT_CM   0x20000000   Centimeters (+/- i.e., 3.12cm)
  CSS Size English      
    PT_IN_100   0x30000000   Inch (100ths) (+/- i.e., 2.50in)
    PT_IN   0x68000000   Inch (10000ths) (+/- i.e., 2.3250in)
  CSS Size Typography      
    PT_PX   0x40000000   Pixel (+/- i.e., 4.84px)
    PT_EM   0x50000000   Em Spaces (+/- i.e., 2.23em)
    PT_EX   0x60000000   Ex Height (+/- i.e., 1.15ex)
    PT_PC   0x70000000   Picas (+/- i.e., 12.50pc)
    PT_PT   0x80000000   Points (+/- i.e., 22.40pt)
  CSS Angle      
    PT_DEG   0x90000000   Degrees (+/- i.e., 4.01deg)
    PT_GRAD   0xA0000000   Gradians (+/- i.e., 21.22grad)
    PT_RAD   0xB0000000   Radians (+/- i.e., 2.77rad)
  CSS Time      
    PT_HZ   0xC0000000   Hertz (+ i.e., 122.12hz)
    PT_KHZ   0xD0000000   Kilohertz (+ i.e., 12.11khz)
    PT_MS   0xE0000000   Milliseconds (+ i.e., 12.11ms)
    PT_S   0xF0000000   Seconds (+ i.e., 4.23s)

Since pvalue formatting uses all bit positions (including 0x80000000), pvalues can easily be confused for formatted error codes. Programmers are cautioned on using IsError and IsNotError and related functions on data declared as a pvalue type.

11.1.7 TOKEN Data Type

As mentioned above, elements, attributes and properties can be referenced by string name or by token value. Tokens are 32-bit dword values with a special type definition of TOKEN. The top bits classify the token types:

  Defined Data   Value   Description
  Token Control        
    TT_TYPE_MASK   0xF0000000   Token Type Mask
    TT_TOKEN_MASK   0x000FFFFF   Token Value Mask
    TT_TOKEN_MASK_16   0x0000FFFF   Token Value Mask (non field)
    TT_USER_FLAG   0x00008000   Token is user-defined
  Fields      
            Note that fields can receive pseudo token status for SGML open/close for stacking and other purposes.
    TT_SGML_FIELD_MASK   0x000F0000   Field Mask
    TT_SGML_FIELD   0x00030000   Field Type/Name
  SGML (HTML/XML)      
    TT_SGML_OPEN   0x10000000   SGML Start Element (i.e., TABLE)
    TT_SGML_CLOSE   0x20000000   SGML End Element (i.e., /TABLE)
    TT_ATTRIBUTE   0x30000000   SGML Attribute
    TT_ENTITY   0x40000000   Entity
    TT_VALUE   0x50000000   Named Entity Values (Properties as defined in a DTD such as NUMBER or %Length)
    TT_NAMESPACE   0x60000000   XML Name Space
    TT_NAMESPACE_DEFAULT   0x60000000   Default Namespace (zero token mask value)
  CSS      
    TT_CSS_PROPERTY   0x70000000   CSS Property (i.e., border or font-size)
    TT_CSS_RULE   0x80000000   CSS Rule (i.e., @import)
  Miscellaneous      
    TT_NULL   0xF0000000   Item is null or empty (attribute, etc)
    TT_ERROR   (TT_NULL + 1)   Error in Item
    TT_UNIVERSAL   (TT_NULL + 2)   Universal (i.e., * specified for a class name)
    TT_UNIVERSAL_IMPLIED   (TT_NULL + 3)   Universal (i.e., not specified as implied as universal)

 

Like pvalues, tokens can use the high bit and programmers should be careful to avoid using the IsError and IsNotError functions on tokens.

The SDK contains predefined token values for HTML and CSS. Programmers should not hard code tokens since the values can possibly change from version to version of the application.

11.1.8 SGML Classes and Objects

Major object/function groupings:

SGML Object — Low-level parsing, reading and writing.

DTD Object— Low-level functions for managing a Document Type Definition or XML schema.

HTML Table Object — Medium-level HTML table mapping and support.

HTML Header Object — Low-level support for HTML file headers.

HTML Outline Object— High-level document outline.

SGML Code Tools — High-level tools for testing and adjusting generic SGML.

HTML Code Tools — High-level tools for testing and adjusting HTML.

HTML Writer Object — High-level HTML writing.

HTML Page Break — Medium level functions for creating, reading and managing structured page breaks.

HTML Fields — Low-level functions for managing proprietary HTML fields.

RSS Feed Object — Low-level functions for reading RSS or Atom feed files.

HTML Compare — Low-level functions for comparing HTML documents.

Each of these classes have their own object handle type and are discussed in the following sections.