Comma delimited data is perhaps the simplest way of representing a table of information. Many applications that export data will produce a CSV file, allowing for simple and fast extraction of data. However, despite the fact that CSV is a common format, it is not standardized. Reading and writing may also seem simple, but there are sticking points, such as how to handle commas and line returns that occur in the data itself. Wouldn’t it be nice to have tools to deal with CSV data? Legato does.
Friday, January 05. 2018
LDC #66: CSV - The Simplest Data Format
Within this article we will discuss:
– What is the CSV format
– Reading and writing field level data using a CSV Object
– Ad hoc low level CSV functions
– Reading and writing string tables as CSV
– Data Sheet functions
– Processing CSV data in conjunction with the clipboard
What is CSV?
Comma Separated Values (CSV, also commonly known as comma delimited lists) is a de facto standard to exchange simple data tables in plain-text form. Plain-text means that the file is a sequence of characters with no binary data or codes to be interpreted. Instead each line of text is a table row or a record.
The format is meant to convey rows and columns in the form of lines and fields. It’s named “comma-separated” because each field is separated by comma. If a field’s text contains a comma, the data within the field is enclosed as a quoted string. For example, fields:
1 A String text with a comma, 0.00
would be represented as:
1,A String,"text with a comma,",0.00
or
"1","A String","text with a comma,","0.00"
Text is not necessarily encoded except for enclosure within quotes for fields with a comma or if the text has a double quote. Double quotes must be escaped with an additional quote. For example:
Mike said, "This is my string"
This is represented as:
"Mike said, ""This is my string"""
A CSV file consists of any number of records separated by line breaks. However, line breaks can be inside a quoted field. Each record consists of fields separated by comma. Usually, all records have an identical number and sequence of fields. While there are no standard provisions for field names, data types or row characteristics, it is common for the first row to be the field names for the table.
There are variations in CSV as well as compatibility issues. For example, Excel allows for the embedding of return characters (0x0D) within a quoted field, and this counts as return code in the field data, not a new record.
Finally, the character set can be set by some applications. Excel will allow CSV export using UTF-8. If the original file contained Unicode characters and UTF-8 is not employed, then those characters that cannot be converted are translated to question marks (“?”). Note that as of the date of this article, non-ANSI CSV is not supported in Legato. That is expected to be released later in 2018.
CSV Functions
The Legato SDK has about two dozen CSV related functions, many of which organized with the prefix CSV. CSV processing is also part of certain objects such as Data Sheet. In addition, CSV is supported by the application for import, export and display using Data View or via HTML table functions.
Legato CSV SDK functions operate in two distinct manners: as a group working on a CSV Object or as standalone functions operating on discrete bits of data. When working with a source that may contain line breaks within fields, you should avoid handling data by reading lines from a source file and then processing them as records since any field with a return character will break the flow and end that record. Rather, process the entire stream at once or ‘set’ the entirety of data into a CSV Object.
The CSV Object
Legato’s CSV object allows for control of input and output of CSV data. The object is created using the CSVCreate function:
handle = CSVCreate ( [string data] );
On success, a handle is returned to a CSV Object. Optionally, supplying the data parameter allows information to be set automatically into the input buffer. To create CSV, omit the data parameter.
Once an object is created, you can still add initial data to it using the CSVSetRecord function:
int = CSVSetRecord ( handle hCSV, string data );
Upon execution, all the internal controls of the CSV Object are reset. The data parameter operates in the same manner as when used with the CSVCreate function.
When a function or script has finished using a CSV Object, it can close the handle or let it be automatically closed on function exit if locally declared. The CSVReset function can also be used to quickly clear an object.
A CSV Object can be used simultaneously for input and output.
Reading Data Using an Object
There are a series of Legato functions that read data from a CSV Object. First, an entire row can be read with the multifaceted CSVGetFields function:
string[] = CSVGetFields ( handle hCSV );
The function returns all fields as text within the current record specified the hCSV object handle. individual fields can be read by the CSVGetNumericField and CSVGetStringField functions.
long = CSVGetNumericField ( handle hCSV );
string = CSVGetStringField ( handle hCSV );
Numeric values are interpreted by passing the string to the equivalent of the TextToInteger function. The numeric value can be decimal, octal (leading zero), or hexadecimal. The return code from the GetLastError function can differentiate between errors, zero fields, and the end of the record. For maximum flexibility, the return data type is long (64-bit signed) and can be set to an int type without a run time warning so long as the value does not overflow. The string version obviously returns a string. Again, the GetLastError or IsError functions can be used to determine if the data has been exhausted or if there was an error should an empty string be returned.
Finally, there are two functions supplied that can be used to determine if data within the object has been depleted: the CSVIsLastField and CSVIsLastRecord functions. The CSVIsLastField function can preemptively check if the source fields have been exhausted prior to calling the CSVGetStringField or CSVGetNumericField functions. The function considers the end of a row (line) as a last field.
boolean = CSVIsLastField ( handle hCSV );
Similarly, the CSVIsLastRecord function checks to see if all data has been depleted:
boolean = CSVIsLastRecord ( handle hCSV );
Note that if the handle is invalid, all of these functions will set the last error condition to indicate this. Let’s look at an example using low level functions:
handle hCSV; string s1; int rx, cx; s1 = "(insert your sample filename and path)"; s1 = FileToString(s1); if (IsError(s1)) { MessageBox('x', "Error %08X opening source\r\r", GetLastError(), s1); exit; } hCSV = CSVCreate(s1); while (CSVIsLastRecord(hCSV) == FALSE) { rx++; AddMessage("Row %d", rx); cx = 1; while (CSVIsLastField(hCSV) == FALSE) { s1 = CSVGetStringField(hCSV); AddMessage(" %d : %s", cx, s1); } CSVGetStringField(hCSV); }
In the above code, we load the contents of a file into a string and then create a CSV object using that string. From that point we enter a loop that first tests whether there is remaining data and then loops through each field. Note that on exit of the field loop, we get one last field to force the object to trip to the next record. The above example allows a program to walk through the CSV data and process it as required.
Finally, there is an SDK function that counts the number of fields in the current record:
int = CSVGetFieldCount ( handle hCSV );
The return value is number number of fields (or a formatted error code).
Creating CSV Using an Object
The CSV object has two functions to allow data fields to be appended to a buffer and then written to a Basic File Object. The append function is as follows:
int = CSVAppendField ( handle hCSV, mixed data );
The return value indicates whether an error occurred such as invalid handle or overflow. The data parameter can be a string or an integer based value. Upon execution, the function will format the data as required and then append the field to the object’s output buffer.
When a record has been completed, the write record function can be called:
int = CSVWriteRecord ( handle hCSV, handle hTarget );
Data in the output buffer is flushed to the file with a delimiting line return. hTarget can be acquired by using the OpenFile or CreateFile functions.
Generally, the above functions allow you to incrementally build records and add them to a file. There are simpler and more efficient methods to perform the same operation that we will discuss later.
Low Level Stand Alone CSV Functions
CSV support is also provided on an ad hoc record-level basis not requiring a CSV Object. Note that that some functions we already covered, but in this mode they operate without a CSV Object parameter. These take information to and from arrays to read or create comma delimited data. The first function takes an array of strings and creates a CSV record:
string = CSVArrayToString ( string data[], [int fields] );
Data is processed and returned as a string after each array element of data processed. An optional fields parameter can be provided to limit the number of fields to add or pad out any empty fields should the data array be too short. The resulting string can then be appended to a pool, written to a file, or otherwise processed as required.
The reciprocal function, the CSVGetFields function, breaks apart a CSV string into a series of array elements:
string[] = CSVGetFields ( string data );
The field count function also operates on a standalone string:
int = CSVGetFieldCount ( string source );
Read or Write an Entire Table
In most cases it’s easiest simply to read or write an entire CSV file into or from a string table. Two functions in the API do just that: CSVReadTable and CSVWriteTable. The CSVReadTable function appears below.
string[][] = CSVReadTable ( handle hFile | string name );
The source can be a handle to a Basic File Object or just a simple filename. Data is returned as a simple string table in the form of rows and columns. There are a couple of advantages to using the CSVReadTable function. The first of which is the data can be randomly accessed. Secondly, the function operates fairly quickly. As always, to determine if an error occurred, use the IsError and/or GetLastError function.
Once loaded, the data can be managed like a regular string table.
A string table can also be directly written as CSV:
int = CSVWriteTable (string table[][],
string filename | handle hFile, [boolean endings] );
The function writes the table array (two-dimensional) to a file by filename or as a Basic File Object as specified by hFile. An optional endings flag can be added to change how line endings are added. The return value is a formatted error code.
Data Sheet
Another method of handling CSV data, which may be overly sophisticated for some circumstances, is to use a Data Sheet Object:
handle = DataSheetOpen ( string filename );
The DataSheetOpen function examines the file extension of the filename parameter. If it’s “.csv”, it will import the CSV data into the sheet. It can then be accessed and managed using Data Sheet functions. This is a large subject which we will cover in a later article.
CSV and the Clipboard
Another interesting option is using the Windows clipboard to move CSV. Consider the following functions:
string[][] = ClipboardGetCSVData ( [handle hClipboard] );
This first function will take CSV data directly from the clipboard and return a string table. This is very convenient if one would like to grab some information from an application such as Excel, process the data, and perhaps create a formatted HTML table.
string = ClipboardGetCSVText ( [handle hClipboard] );
Similar to getting a table, the above will read the raw CSV data from the clipboard in the form of a string. That can then be used with other functions such as the CSVCreate function.
You can also go the other way:
int = ClipboardSetCSV ( handle hClipboard, string data | string table[][] | handle hPool );
Using this function, CSV data can be posted directly to the clipboard.
Other CSV Venues
Finally, there are non-Legato CSV processing options in GoFiler that we have not covered. These include built-in application level translate functions to load CSV into formatted text or HTML and capturing CSV from HTML tables. Some can be directly accessed via Legato using the ConvertFile function.
Once data is an array or table, it can also be used with ODBC (SQL), dialog controls, and many other functions to achieve whatever processing you need.
Conclusion
There are a lot of methods to transport general data. CSV may not be the most elegant, but it is simple and common. Whether you have a one-off project to read a bunch of data or need a utility to read and process CSV data, Legato has you covered with good options and a variety of tools to help you get it done.
Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato