Editing text is the foundation of many computer applications, and there are a lot of ways to access and manage textual data and the editing process. As a user changes text, characters and lines of data must be altered, deleted, or inserted, which in turn requires potentially large swaths of data be shifted to accommodate the changes. As you might imagine, actions like these can become extremely cumbersome and time consuming. In addition, there are times you might want to randomly access a data within a text file, say by a line. If we have a large chunk of text in memory or a file and we want to go to line 3017, the program would have to count line endings from the start of the data until it reached the 3017th link. Yuck! Let’s see what Legato can do to help managing text as it’s edited.
Friday, July 21. 2017
LDC #44: Mapped Text Objects: Power in Editing Text
One method of improving access is to treat a text file like a database. This is exactly what a Mapped Text Object (MTO) does. It maps text lines as records into a database. This provides instant random access. Now we can directly address the 3017th entry in the map and read the line of data. In addition, when we modify a line, the object can simply save the data to a temporary location and then point the map to the modified line. From the caller’s standpoint, there’s no need to know where the data is located. To insert and delete lines, an MTO shifts the map, not the entire file. This is much faster.
Another benefit of the using a map scheme is that the program can carry additional functionality and information. A few beneficial features of the MTO is the capacity to undo and repeat actions and to perform file recovery. A MTO is used in every edit window within GoFiler, even ones based on binary data.
The MTO consists of six major components:
Data Management — This section deals with opening and mapping files or strings. It also manages saving or exporting. Saving causes all changes to be written and the file remapped, while exporting merely writes each line to a specified unrelated file. Saving will also manage backup files depending on the application’s settings.
Segment Processing — Segment processing is the highest level of data access. It allows for textual data to be treated as x/y to x/y segments. Segment access can also operate on a transactional basis, allowing Undo and Redo information to be stored and played back.
Meta Data Management — Meta data, including editing statistics, caller data, and line by line flags, can be managed through the MTO. Meta data also includes information such as what windows, if any, are associated with the MTO.
Entry Point Table (EPT) — The EPT is essentially the database index of the file on a line-by-line basis. Changes to the data at this level are not tracked. EPT access function can also manage tab characters handing native versus realized data.
File Recovery Journal (FRJ) — The FRJ is closely related to the Undo operation. As segments are modified, recovery records are added.
Dirty List Management (DLM) —The DLM is used to aid in the processing of data changes in the background.
For this article, we will be focusing mostly on Data Management, EPT, and Segment Processing.
EPT Theory of Operation
The heart of the MTO is the Entry Point Table (EPT). Each line in a file will have an entry in the EPT. The caller can access data via segments or directly line-by-line:
As lines are revised, the table is updated to point to a temporary area with the latest data.
During the initial mapping operation, the first two bytes of the file are examined to determine if the content is Unicode. Unicode is byte ordered as “little endian” or “big endian” depending on the source system, meaning that the 16-bit words are made from bytes as least and most significant, or, most and least significant. This is largely a function of the source system’s underlying CPU and operating system. While loading Unicode, each line is checked for character above 0xFF (more than 8 bits) and a flag is set to indicate the presence of 16-bit data. It is important to note that the current class does not support writing Unicode and Legato does not presently support 16-bit strings.
If Unicode is not detected, the content is treated as 8-bit with characters below 0x20 (spaces are treated as control characters). This is normally ANSI, ASCII or some variation such as UTF-8. Regardless of 8 or 16-bit, return (0x0D) and line feed (0x0A) characters are checked and a determination is made on how to process line endings. Again, depending on the source system and software, there are various combinations. The line ending mode is generally automatically detected and compensated for, even with mixed modes. Tab characters (0x09) are also detected such that the caller can determine a process method. Any zero bytes in the stream will result in the content being marked as binary. The caller can then determine how to process the data.
As a compromise to save space for each EPT record, some limits are in place. First, the location address is up to 30-bits in size (the top two bits are used for location control). Therefore, the maximum size of the mapped source and working data in memory is limited to 1 gb, or 1,073,741,823 bytes, each. This in turn means an MTO cannot handle files larger than 1 gb in size, and due to the restrictions on the working data, files between 500 mb and 1 gb may encounter issues in edit tracking. If the file is manipulated enough without saving, the working data limit can be exceeded. This is not generally a problem and we have not encountered that limit. The second is the width of a line is limited to 65,535 bytes. Larger lines can be mapped and controlled. This will be discussed later. Again, well-formed text and code should not hit these limits.
Each entry also includes meta data for tracking line level changes, markers or any other data the caller desires. However, the meta data is in binary bits with a limited number of positions for each entry.
The EPT table also contains “attribute” data. This information is not presently available to scripts. It is used to represent font, style and other object reference data.
During a save operation the data to which the map points is transferred back to the file or to a new file. That file is mapped and becomes the ‘Original File’.
Supporting the Application Frame
As mentioned above, all edit windows (and some non-edit windows), rely on MTO as their primary file. MTOs can stand alone, but it is important to understand the larger scheme if a Legato script is to effectively access and manipulate data in the larger sandbox:
All edit windows also rely on an intermediate class known as the “Edit Object.” The MTO is not aware (and in fact does not care) where your caret is, what is selected on the screen, what your viewport position is, etc. This is where the Edit Object comes in. They only exist when a user of some action causes a change or inspection of data. Many MTO functions will work with an MTO handle or an Edit Object handle. Since Edit Objects carry positional information, they are unique to the window and are destroyed when the editing session has been completed. Note that an MDI edit window can have many views; all of them will point to a single MTO shared by the views.
A number of functions exist to help access Edit Objects:
handle = GetActiveEditObject ( );
The GetActiveEditObject function can be used with a hook to get the current or active edit window’s Edit Object. Note that using this function in a script running from the IDE will result in the script window’s Edit Object being returned. A specific window’s edit object can be accessed using:
handle = GetEditObject ( handle hwTarget | int index | string name );
Likewise, a handle to the MTO can be retrieved:
handle = GetMappedTextObject ( handle hwTarget | int index | string name );
Both of the above functions take a window handle, an edit window index number, or a filename as an input parameter. If the functions fail, the returned handle value will be NULL_HANDLE.
File Locking
Since the EPT references a file throughout the document edit life cycle, the source file cannot be changed without being remapped. As a consequence, all files are locked by necessity. However, read sharing is allowed. Read-only files can be opened, but the save operation is prohibited. However, the data within the file is not updated until the file is saved.
Opening a File
As shown above, an active MTO handle can be retrieved for an edit window. But what if you want to have your own MTO? Then we open or create one:
handle = OpenMappedTextFile ( string name, [dword mode] );
On success, a handle is returned. The name can be any filename and path, including HTTP references although HTTP references are treated as read-only. On failure, use the GetLastError function to retrieve a formatted error code. The optional mode parameter specifies how to open the file:
Definition | Bitwise | Description | ||
MFC_OPEN_READ | 0x00000008 | Open as read-only share | ||
MFC_ALLOW_READ_ONLY | 0x00000010 | Open any ready only | ||
MFC_NO_CACHE | 0x00000080 | Do not read cache (HTTP) | ||
MFC_RECOVERY_TRACKING | 0x00001000 | Use recovering tracking | ||
MFC_UNDO_TRACKING | 0x00002000 | Keep undo information | ||
MFC_ALLOW_TABS | 0x00004000 | Allow and process tabs, by default tabs are converted and formatted to spaces | ||
MFC_ALLOW_SHARE_TRACKING | 0x00010000 | Allow file share tracking |
MFC stands for Mapped File Control. Bits can be logically ORed to combine options.
To create a new file and MTO:
handle = CreateMappedTextFile ( string name );
The name is any valid URN to which the user has create/read/write privileges. Finally, an empty MTO can be create or an MTO can be created using a string:
handle = CreateMappedTextString ( string data );
When a script is ready to comment the changes, the save function can be called:
int = MappedTextSave ( handle hObject, [string filename], [dword flags] );
The return int will contain a formatted error code, and ERROR_NONE indicates success. An optional filename can be provided to save to a different location. If the MTO was created from a string, the filename parameter must be supplied. Finally, the optional flags parameter specifies how manage the save process:
Definition | Bitwise | Description | ||
MSF_BAK_FILE_LIMIT | 0x0000000F | Limit of Journal Files | ||
MSF_BAK_FILE_AS_JOURNAL | 0x00000100 | Perform Backup as Journal Files (name (01).bak) | ||
MSF_BAK_FILE_HIDDEN | 0x00000200 | Make Backup Files Hidden | ||
MSF_NO_BAK_FILE | 0x00000400 | No Backup File | ||
MSF_NO_NEWLINES | 0x00010000 | No 0x0A New Line Codes | ||
MSF_WRITE_ATTRIBUTES | 0x00020000 | Write Attributes to File | ||
MSF_OVERRIDE_READ_ONLY | 0x00040000 | Override Read-Only Setting | ||
MSF_NO_FILE_SAVE_NOTIFY | 0x00100000 | Do Not Notify Application of Change (this only applies to objects associated with one or more windows) |
MSF stands for Mapped Save Flags. Bits can be logically ORed to combine options. Please note that if the MTO is associated with a window, the menu function should be used to perform a save operation since the view may also perform various tasks as part of the save process, including reading information from the view to be saved in the file.
To export data while not affecting the MTO state, use the export function:
int = MappedTextExport ( handle hObject, string filename, [dword flags] );
The same parameters apply except the only flag that is active is the MSF_NO_NEWLINES flag.
Another way to retrieve data from a MTO is to request a string:
string = MappedTextToString ( handle hObject, [boolean newline] );
This function will return a string containing the contents of the MTO. By default, lines are separated by single return characters (0x0D).
But before we cover routines to alter the contents on an MTO, it is important to note that there are two ways to modify data within an MTO: non-transaction based line level and transaction based segment level. All MTOs supporting windows will be transaction based.
Non-Transaction Line Access
A number of functions are provided to perform basic line level tasks like read, replace, insert, and delete. In many cases, even if using segments (discussed below), reading lines can be desirable. All line functions expect that the line of text does not contain return or new line codes.
To begin with, It might be good to know how many lines are in a file or MTO.
int = GetLineCount ( handle hObject );
The GetLineCount function will return the total lines within an MTO or Edit Object. To directly read a line, the ReadLine function is used:
string = ReadLine ( handle hObject, [int index | position] );
This function returns a string, which can be empty if the line has no data or if the function encountered an error. Use the IsError or GetLastError functions to check for an error or get an error code. The first parameter, hObject, references the object from which the line will be read. For this discussion, the parameter would contain a handle to an MTO or an Edit Object. When using the ReadLine function with an MTO, the zero-based index parameter is required. The MTO will automatically retrieve the data from the file or temporary area depending on whether or not the data was previously modified. If the source was Unicode, it will be converted to 8-bit ANSI. Note that Legato does not presently support 16-bit wide character processing for Unicode. Also note that the ReadLine function works more object types than only the MTO such as a string pool.
The flip side of the ReadLine function is the ReplaceLine function. It allows data to be written back to the MTO. In the prototypes below, hObject is refined as hMappedText.
int = ReplaceLine ( handle hMappedText, int index, string data );
The function returns a formatted error code on failure, but if used properly, it returns ERROR_NONE. The index parameter must specify an existing line. To add data, the InsertLine function must be used:
int = InsertLine ( handle hMappedText, int index, string text );
Again, the function returns a standard result. The InsertLine function inserts the new line prior to the specified index parameter. The index can be specified as -1 to append to the end of the map. This relieves the script from having to get the last index position and calculate the write position to append data.
Finally, lines can be deleted:
int = DeleteLine ( handle hMappedText, int index, [int count] );
If the count parameter is not provided, then a single line is deleted.
Transaction Based Segment Access
Rather than accessing data by lines, higher level functions are provided to access and alter data by segments or regions. In these cases, there’s no need to specify a line. Regions are defined as x/y-x/y segments. This method supports undo operations and file recovery. Segment X positions are always native, that is, tab characters are counted as one position. There are functions to aid in handling translation of tab positions.
To read a segment:
string = ReadSegment ( handle hObject, [int sx, int sy, int ex, int ey] );
The function returns a string, which can be empty on failure. Use the GetLastError function to check for errors. Line breaks are represented by returns only (0x0D). The segment parameters are optional only when using an Edit Object since an Edit Object can contain a selected area. If omitted in that case, whatever is selected is returned.
To replace a segment:
int = WriteSegment ( handle hObject, string data, int sx, int sy, [int ex, int ey] ); int = WriteSegment ( handle hObject, string data );
These functions have essentially three flavors: (i) for an Edit Object, the position can be omitted to replace the selected area or insert at the caret; (ii) for both object types, the sx and sy parameters can be specified to insert at that position; and finally, (iii) a complete segment can be specified.
Obviously, to insert lines either have more line endings in the data to be written or don’t specify an ending position for the segment.
Other Things to Note
Some other line based functions are as follows:
int = GetLineSize ( handle hObject, int index, [boolean realized] ); boolean = IsBlankLine ( handle hObject, int index ); int = MoveToNonBlankLine ( handle hMappedText, int index, [boolean backward] ); int = NativeToRealized ( string reference | [handle hMappedText, int line], int position ); int = RealizedToNative ( string reference | [handle hMappedText, int line], int position );
What the GetLineSize function does is likely obvious except for the realized flag. When set to TRUE, tabs are expanded to the default and then the size of the line is returned. The IsBlankLine and MoveToNonBlankLine functions help manage empty lines (which includes lines with only white space). Finally, the NativeToRealized and RealizedToNative functions calculate positions based on tabs within text.
As mentioned at the start of the article, MTOs support various kinds of meta data. Named text fields can be added or read using the SetObjectMetaData and GetObjectMetaData functions. While the utility of this type of meta data may seem limited when using an MTO to access a file, it can be invaluable when working with edit windows. Setting meta data allows scripts to be aware of other script actions or to carry information asynchronously.
Other meta data:
dword = GetMappedTextEncoding ( handle hMappedText ); string = GetMappedTextEncodingString ( handle hMappedText ); dword = GetMappedTextFileType ( handle hMappedText ); string = GetMappedTextFilename ( handle hMappedText );
As a file is mapped, the GetMappedTextEncoding and GetMappedTextEncodingString functions can be used to determine if the source is Unicode and what type. The file type and name can also be retrieved.
Conclusion
Having a good working knowledge of MTOs is essential to working with edit windows and SGML/HTML objects. One item touched upon above was long lines or overflowed lines. If a line exceeds 65Kb and the MTO is being used with an SGML Object, the parser will automatically compensate by creating an artificial break in a long line, which is transparent to an SGML reader. Tricks like these can make parsing and editing text with MTOs in Legato easy and powerful.
Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato
Quicksearch
Categories
Calendar
November '24 | ||||||
---|---|---|---|---|---|---|
Mo | Tu | We | Th | Fr | Sa | Su |
Thursday, November 21. 2024 | ||||||
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 |