There are many occasions when a programmer needs to read or write from a text file (or HTML/XML). While sometimes you may need to complete just a minor task like accessing a file name, other times you may have to hold and edit the file contents. Legato offers a spread of low to high level functions for file access. This article explores various methods of manipulating file data. In subsequent articles, we will explore each in detail.
Friday, May 19. 2017
LDC #35: Accessing Text Files
Introduction
In many cases when it comes to file use in GoFiler, read and write operations involve textual data or some encoded version of textual data. For encoding text and binary data, there is a suite of Legato functions that offer support. However, in this article we will be focusing on text based file operations.
Depending on the level of the operation, specifics regarding text can be important. For example, the type of line endings or the maximum length of a line of text can make a big difference when reading and writing file contents. Other considerations include the file encoding, file locking, and sharing. The following table lists the types of operations and objects we will be discussing with approximate level of sophistication:
Type | Random Access | File Lock & Share | Encoding | Description |
Basic File Object | Yes | Yes | Any | Flat random access file. |
Files/String | No | No | 8-bit | Routines to read and write data directly to and from string variables. |
Pool Object | No | No | 8-bit | An object that allows large pools of string data. |
CSV Object | No | No | 8-bit | An object that provides the ability to read and write comma delimited data. |
Data Sheet Object | Yes | No | 8-bit | An object that provides spread sheet style data support. |
Settings Functions | Yes | No | 8-bit | Functions to read and write data in setting or INI format. |
SGML Object | Yes | Yes | 8-bit | An object specializing in reading and writing HTML and XML data. |
Mapped Text Object | Yes | Yes | 8/16-bit | A text centric object that is designed for editor support. |
Edit Object | Yes | Yes | 8/16-bit | A text centric object that is designed for editor support. |
This is just a subset of file objects and operations. There are many other objects that read files ranging from an RSS feed to a compressed or zipped file. Let’s take a closer look at some of the terminology in the above table.
‘Random Access’ is the ability of the program to seek out a specific area of the file at random and in any order. Otherwise, data is considered serial and is read from first chunk of data to last chunk of data. Once a file has been read into a string or array (cluster), the data can be randomly accessed Random access can be considered accessing by byte offset, line and character position, or via x and y coordinates in the form of rows and columns.
‘File Lock and Share’ refers to locking a file during all operations such that other users, processes, or programs cannot read or alter the file’s content unless specific sharing settings are enabled. For objects and functions that load the content or write the content in a single fell swoop, there is the lowest level of read sharing on load and an exclusive lock on write. Other objects can open the file exclusively or allow other users to read or write the file.
‘Encoding’ describes how the file is encoded on a fundamental level. There are essentially two types of encoding: binary and text. On a binary level, text based files are either encoded as 8-bit bytes with an ASCII base or 16-bit words as Unicode. Since fundamentally all data is stored on devices are bytes, or 8-bit chunks, attempting to read a Unicode file as ASCII will lead to disaster. For western languages, the bulk of a Unicode file will contain a zero byte at every other character. For example, in ASCII, ‘ABC’ in hex is 0x41 0x42 0x43 as 8-bits. In Unicode, the same text is 0x0041 0x0042 0x0042, but in the raw file it is stored as 0x41 0x00 0x42 0x00 0x44 0x00. For a reader expecting 8-bit characters, this is going to be a problem. By convention, the first two bytes of a Unicode file contain a marker not likely to be found in plain text. Unicode readers can use this marker to switch from reading 8-bits to reading 16-bits through combining two bytes. Larger 16-bit Unicode can be represented in 8-bits by using an escape method and an encoding format called Unicode Text Format or UTF. At a higher level, the text might be further encoded as base64, UUencode. or any number of other formats. They all rely on underlying 7/8-bit ASCII, however. There are other encoding methods, but most of them have slowly faded into antiquity.
Next, we will make a cursory examination of some of the items in the above table.
Basic File Object
The Basic File Object (BFO) is the most fundamental method of accessing and writing file data. Most programming languages have basic create, open, read, write, and close file operations. In addition, there are functions to delete, rename, and enumerate files that are beyond the scope and subject of this article. Further, as you read through this blog, you will find that the BFO is the least desirable for working with simple text. However, it is good to have a working knowledge of the functions as they may come in handy.
To open a file the OpenFile function is used:
handle = OpenFile ( string name, [dword mode] );
The function returns a handle, a special variable used for subsequent calls to access the file. When processing of the file is completed, the universal CloseHandle function will release and close the file. If the OpenFile function fails, the returned handle will be zero (NULL_HANDLE). The IsError function can check the handle and the GetLastError function will return a formatted error code. As with all functions that return handles, there is no need to ‘close’ a failed handle.
The file can be located locally, on a network, or even on a web server. The name parameter specifies the location and name of the file, If the file is referenced as a URI scheme, it must be read only access.
The mode is a series of bitwise values that tell the function the desired access. We’ll explore these bits in more detail in another article in the series.
In addition to opening a file, a new file can be created:
handle = CreateFile ( string name, [dword mode] );
Unlike the OpenFile function, the CreateFile function can only create files on the local computer or on network locations for which the script has appropriate rights. The name and mode parameters operate in the same manner (except creating a read-only file is a bit counter-intuitive).
Data can be read and written to basic files by text lines or in blocks.
To read the content of a file, the ReadBlock function will retrieve a chunk of a file at the current file position. Reading blocks is designed more for random access of binary data, but it can be used to read anything.
int = ReadBlock ( handle hBasicFile, parameter *data, [int bytes] );
Generally, the ReadBlock function is used to read data into a character buffer. If a string is used, it must be a character array.
The return value is either the actual number of bytes read or zero on failure. The bytes parameter is the requested amount of data to read. If omitted, the size of the data parameter is used. Here is an example of serially reading buffers of data:
handle hFO; char cb[64]; int fx, rc; hFO = OpenFile("http://www.sec.gov/Archives/edgar/data/51143/000095012394002047/0000950123-94-002047.txt"); if (hFO == NULL_HANDLE) { MessageBox('x', "Could not open file."); exit; } fx = GetFilePosition(hFO); rc = ReadBlock(hFO, cb); while (rc != 0) { if (rc != 64) { cb[rc] = 0; } AddMessage("%08X - %s", fx, cb); fx = GetFilePosition(hFO); rc = ReadBlock(hFO, cb); }
In the above code, we are opening a random file from the SEC. The program then gets the current file pointer position and reads a block into the character buffer cb, and the size of the read is automatically set by the size of cb. We continue in the loop until rc is zero, indicating all the data in the file has been exhausted. The buffer cb and position are added to the default log. (Note that the log does not display the line returns in the text. To convert them, use the ConvertNoCodes function.)
A little bit of code is added prior to writing to the log to check to see if rc is not the same size as the buffer (presumably less), and in such case, the script terminates the last location with a zero. If we do not do this, the last buffer displayed will contain information from the previous read since the ReadBlock function only reads what is specified or available on a binary basis and does not clear the remaining area. As a side note, when cb is referenced as a string, Legato will automatically add the terminating zero to the parameter to stop the string read from progressing outside the scope of the char buffer. Here is the example output to the default log:
00000000 - -----BEGIN PRIVACY-ENHANCED MESSAGE-----.Proc-Type: 2001,MIC-CLE 00000040 - AR.Originator-Name: keymaster@town.hall.org.Originator-Key-Asymm 00000080 - etric:. MFkwCgYEVQgBAQICAgADSwAwSAJBALeWW4xDV4i7+b6+UyPn5RtObb1c 000000C0 - J7VkACDq. pKb9/DClgTKIm08lCfoilvi9Wl4SODbR1+1waHhiGmeZO8OdgLUCAw . . . 00004580 - ed December 31, 1993 have been so incorporated in reliance on th 000045C0 - e report of.Price Waterhouse LLP, independent accountants, given 00004600 - on the authority of said.firm as experts in auditing and accoun 00004640 - ting.. . 4.</TEXT>.</DOCU 00004680 - MENT>.</IMS-DOCUMENT>.-----END PRIVACY-ENHANCED MESSAGE-----.
The above example is clumsy for reading text unless one was trying to retrieve a segment of information directly from a file. When combined with the SetFilePosition function, the ReadBlock function can easily pluck text from a specific location as needed.
Another direct file read option is the ReadLine function:
string = ReadLine ( handle hObject, [int position] );
The ReadLine function will read from the specified or current position and return a line text, less the line ending characters. You’ll notice that the object is specified as generic hObject. This is because the ReadLine function actually works with a number of object types. Here we will just focus on the Basic File Object for which the position parameter is a file position. A modification to the above the ReadBlock function example could be as follows:
handle hFO; string s1; int fx; hFO = OpenFile("http://www.sec.gov/Archives/edgar/data/51143/000095012394002047/0000950123-94-002047.txt"); if (hFO == NULL_HANDLE) { MessageBox('x', "Could not open file."); exit; } fx = GetFilePosition(hFO); s1 = ReadLine(hFO); while (IsNotError(s1)) { AddMessage("%08X - %s", fx, s1); fx = GetFilePosition(hFO); s1 = ReadLine(hFO); }
Our sample output:
00000000 - -----BEGIN PRIVACY-ENHANCED MESSAGE----- 00000029 - Proc-Type: 2001,MIC-CLEAR 00000043 - Originator-Name: keymaster@town.hall.org 0000006C - Originator-Key-Asymmetric: 00000087 - MFkwCgYEVQgBAQICAgADSwAwSAJBALeWW4xDV4i7+b6+UyPn5RtObb1cJ7VkACDq 000000C9 - pKb9/DClgTKIm08lCfoilvi9Wl4SODbR1+1waHhiGmeZO8OdgLUCAwEAAQ== 00000107 - MIC-Info: RSA-MD5,RSA, . . . 000045CC - Price Waterhouse LLP, independent accountants, given on the authority of said 0000461A - firm as experts in auditing and accounting. 00004646 - 00004648 - 4 00004672 - </TEXT> 0000467A - </DOCUMENT> 00004686 - </IMS-DOCUMENT> 00004696 - -----END PRIVACY-ENHANCED MESSAGE-----
Notice that the data is now in the log line by line. The line endings can be 0x0D, 0x0A or 0x0D/0x0A combinations. The maximum size of the resulting string is 65,535 bytes. Reading from a Basic File Object by lines is inefficient. Avoid repeatedly reading or scanning large files using this method.
Writing can also be in blocks or lines and can also occur in random access fashion. The WriteBlock function looks like this:
int = WriteBlock ( handle hBasicFile, parameter name, [int size] );
The parameter size is generally a buffer size and it can be omitted if the entire buffer is to be written. An example to illustrate the function:
handle hFO; string s1; hFO = CreateFile(GetDesktopFolder() + "Write Block Test.txt"); if (hFO == NULL_HANDLE) { MessageBox('x', "Could not create file."); exit; } s1 = FileToString("http://www.sec.gov/Archives/edgar/data/51143/000095012394002047/0000950123-94-002047.txt"); WriteBlock(hFO, s1, GetStringLength(s1));
Most of the above code could be replaced with the HTTPGetFile function, which automatically retrieves data from a specified URI and places it in a file, but the point is to show how the WriteBlock function works. In the example, a file called “Write Block Test.txt” is created in the user desktop area. The entire contents of our URL is loaded into a string, s1, and then written to the file. Notice that the size of the buffer is specified using the GetStringLength function. If we allow the default buffer size, the WriteBlock function will write out the terminating zero at the end of the string. It writes whatever is in the buffer to the current file position exactly, without intervention. If the file is not large enough, it automatically expands the file’s size.
Lines can also be written using the WriteLine function:
int = WriteLine ( handle hBasicFile, string line, [parameters ... ] );
One thing that is interesting about this function is that, like the AddMessage or FormatString functions, parameters can be added to create a formatted string. If parameters are not added, the line is treated as a simple text line and written with a 0x0D/0x0A appended. Here’s a variation on the above write example:
handle hFO; string s1; string lines[]; int ix, size; hFO = CreateFile(GetDesktopFolder() + "Write Line Test.txt"); if (hFO == NULL_HANDLE) { MessageBox('x', "Could not create file."); exit; } s1 = FileToString("http://www.sec.gov/Archives/edgar/data/51143/000095012394002047/0000950123-94-002047.txt"); lines = ExplodeString(s1); size = ArrayGetAxisDepth(lines); while (ix < size) { WriteLine(hFO, lines[ix]); ix++; }
Again, for illustration, we load our random file to a string, explode it to an array and then write each line to the file. Since the WriteLine function appends a return and line feed, the resulting file is one line longer than the write block example.
When working with the line versions of read and write, efficiency should not be of paramount concern. Internally these function do not perform buffering or caching. In today’s fast computer environment, it is hardly noticeable. However, start parsing or writing a couple hundred megabytes and you will feel the pinch.
Files to and from Strings
Textual file content can be easily interchanged with strings using the FileToString and StringToFile functions. Their operation is fairly obvious except it is important to note that a zero byte anywhere in the data will be considered the end of the data. Otherwise, the data is read and written verbatim.
string = FileToString ( string name );
Again, the name parameter can be any qualified filename or even an HTTP URL. The file is opened with the most flexible sharing modes to load the data. Once loaded, the file is closed. On error, an empty string is returned. Therefore, since a file can be empty, to detect an error, the IsError, IsNotError or GetLastError functions must be used. The resulting string can be very large depending the file, so performance will rely on system memory availability. The ExplodeString function can be used subsequently to pull the string into array entries.
To go the other way with data:
int = StringToFile ( string data, string name );
For this function, the return value can directly contain an error result. The data parameter specifies the string to write, while the name parameter is a qualified filename. The process must have rights to write to the specified volume. If there is an existing file by same name, it is overwritten without warning or confirmation. The function expects write exclusive sharing.
These functions can be very useful for quickly reading or writing data in an environment where file lock management is not important or desired. For more examples and discussion, check out our LDC blog #13 Explode Strings and Files.
Files to and from Pools
Another method of accessing text to and from files is to use a Pool. This object provides for loading and managing strings in the form of a large memory block. The block can be a contiguous string or a many separate strings. For the purposed of this section, we will treat a pool as a contiguous block of string data.
To perform any pool operation, we must first create a Pool Object:
handle = PoolCreate ( [string data] );
A handle value is returned to reference the object in later calls. Optionally, string data can be loaded into the pool on creation. Since we are loading from a file, we don’t need this parameter. Once the pool has been created, we can append the contents of a file:
int = PoolAppendFile ( handle hPool, string name );
The pool handle is specified along with a qualified file name. The file can be located locally or can be a URL. The file is opened with the most compliant level of sharing and the entire contents appended to the pool. The return value is a formatted error code, normally ERROR_NONE.
The ReadLine function can read from a number of object types including pools. For pools, it serially reads data one line at a time breaking at conventional line endings:
string = ReadLine ( handle hPool, [int position] );
A string is returned, which may be empty on a blank line or because of an error. Again, the IsError and GetLastError functions should be used to determine whether an error occurred and the nature of the error. Unlike many functions, on success the last error code from the ReadLine function will contain additional information including how that line was ended. See the Legato SDK for additional information on the last error code.
This example shows the use of a String Pool Object:
handle hPool; string s1; int rc, fx; hPool = PoolCreate(); rc = PoolAppendFile(hPool, "http://www.sec.gov/Archives/edgar/data/51143/000095012394002047/0000950123-94-002047.txt"); if (IsError(rc)) { MessageBox('x', "Could not read file."); exit; } fx = PoolGetReadPosition(hPool); s1 = ReadLine(hPool); while (IsNotError(s1)) { AddMessage("%08X - %s", fx, s1); fx = PoolGetReadPosition(hPool); s1 = ReadLine(hPool); }
The output is identical to the Basic File Object example.
After creating a pool and loading it with string materials, you can then write the pool to a file with this function:
int = PoolWriteFile ( handle hPool, string name );
The name parameter references a file that must be in a local or network location for which the script has appropriate rights. The function returns a conventional formatted error code.
Mapped Text Object
The final and perhaps most important method of accessing text files in Legato is the Mapped Text Object. The MTO is a flexible and powerful tool to access text based files for editing. One of the biggest features the MTO offers is transaction control (i.e., the ability to allow the user to undo and redo changes). The MTO is the foundation object for most edit windows and will be covered in detail in a later blog article. In this section, we will introduce the MTO and cover the basics.
An existing Mapped Text Object handle can be obtained from higher level objects, and most Legato SDK functions work equally well between MTOs and Edit Objects. To work with an MTO independent of a higher-level object, the OpenMappedTextFile function allows for a file to be opened for editing and locked.
handle = OpenMappedTextFile ( string name, [dword mode] );
The function returns a handle after successfully opening and mapping a file specified by the name parameter. If a URL is referenced, the file must be open with read only flags for the optional mode parameter.
Here’s the above example implemented this time with a Mapped Text Object:
handle hMap; string s1; int ix, lines; hMap = OpenMappedTextFile("http://www.sec.gov/Archives/edgar/data/51143/000095012394002047/0000950123-94-002047.txt", MFC_OPEN_READ); if (hMap == NULL_HANDLE) { MessageBox('x', "Could not open file."); exit; } lines = GetLineCount(hMap); while (ix < lines) { s1 = ReadLine(hMap, ix); AddMessage("%4d - %s", ix, s1); ix++; }
You will notice in this example that we can get the number of lines in the file using the GetLineCount function. Since the object is an Edit Object and it is referenced by x/y position, we do not generally care about absolute file position. The loop therefore moves through the file in lines and the current line count is specified as the second parameter to the ReadLine function.
Our output now looks a little different:
0 - -----BEGIN PRIVACY-ENHANCED MESSAGE----- 1 - Proc-Type: 2001,MIC-CLEAR 2 - Originator-Name: keymaster@town.hall.org 3 - Originator-Key-Asymmetric: 4 - MFkwCgYEVQgBAQICAgADSwAwSAJBALeWW4xDV4i7+b6+UyPn5RtObb1cJ7VkACDq 5 - pKb9/DClgTKIm08lCfoilvi9Wl4SODbR1+1waHhiGmeZO8OdgLUCAwEAAQ== 6 - MIC-Info: RSA-MD5,RSA, . . . 324 - Price Waterhouse LLP, independent accountants, given on the authority of said 325 - firm as experts in auditing and accounting. 326 - 327 - 4 328 - </TEXT> 329 - </DOCUMENT> 330 - </IMS-DOCUMENT> 331 - -----END PRIVACY-ENHANCED MESSAGE----- 332 -
Analogous functions for writing including the WriteLine and ReplaceLine functions. However, these should only be used for objects that are not under transactional edit control for supporting undo and redo edit operations. A higher level of access for reading and writing are the ReadSegment and WriteSegment functions, respectively. The segment functions work by accessing or replacing a region of a file specified by character and line positions or an x/y position. Further, the WriteSegment function automatically spools data for transaction tracking.
Changes to the source file which the MTO is referencing are not actually applied until the file is saved or exported using the MappedTextSave or MappedTextExport functions. The act of saving manages change transaction counting and remaps the file while exporting simply writes data to a specified file without altering the state of the object. A file, such as the URL specified above, can be saved or exported so long as the destination is to a different name located locally or on a network volume to which the script has write access rights.
As discussed in other blogs, many times files contain or scripts work with tagged data. Handling tagged data is generally best performed by the SGML Object, which actually uses a MTO as its basis.
As mentioned above, Mapped Text Object will be covered in more detail in a later blog. For now, realize that it is a powerful tool that allows for random accessing and editing of very large files.
Conclusion
This blog, while lengthy, gives an overview of some of the methods of accessing and writing file data. Look for future articles to cover some of the above in more detail. The method you chose should be based on what your script needs to accomplish given constraints of file size, access privileges, and how you will be manipulating the file’s contents.
Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato