A very common scenario for a script is to get a file from a user, perform several actions on it and then save the file. As developers it is easy to fall into the “user is always correct” trap. We assume that if we are asking for an HTML file, the user is going to give us one. Whether the user does it on purpose or not, sometimes this isn’t the case and we, as developers, should be prepared for this use case. This blog is going to discuss how we can validate files we receive from the user.
Since we are using a Windows file system most of the user’s files will have extensions. This is a good starting point for file type validation. Generally, unless there is a lack of computer skills (or malicious intent) a file’s extension matches its content. To get the extension simply use Legato’s GetExtension function.
string = GetExtension ( string name );
This function returns the extension of the file, with the leading period. Just adding this quick check can reduce the chance of bad data causing unexpected data in your script. Consider a script that takes an HTML file and a CSV file of changes to make to the HTML file. The last thing a developer needs is an upset user because they accidentally selected a XLSX document and it caused the script to screw up their HTML file with bogus edits. Checking the file extension can prevent these kinds of mistakes. However, this doesn’t always help. Certain extensions like XML can have drastically different file contents. An XBRL instance document is nothing like a 13F information table but both are XML format and XML extension.
So what can we do about this? Luckily, with Legato you can have the application test the contents of a file and determine its file type. This test includes narrowing down to specific types of XML if the application understands the XML coding. To do this we have two options:
dword = GetFileTypeCode ( string source, [boolean extensiononly] );
string = GetFileTypeString ( string source, [boolean extensiononly] );
These functions are essentially the same except GetFileTypeCode returns a numerical value to represent the resulting file type while the GetFileTypeString returns a string representing the type. Both functions will open the file and read a small amount of data to determine information about the file. For example, when called on an XML file the function will look at the namespaces of the XML file to determine what kind of XML it is. The extensiononly option skips the file content analysis. This can still be useful since it will deal with file formats that have multiple extensions like HTM or HTML. When using the GetFileTypeCode function there are defines for the different file types. These can be found in the legato SDK header file (Appendix A of the Legato documentation) but here are some of the common ones:
| | |
FT_ANSI | ANSI Format (CB) |
FT_OEM | OEM Format (CB) |
FT_UNICODE | Unicode Text (CB) |
FT_ASCII | ASCII Text 7-bit |
FT_TEXT | Text Format (Coding Unknown) |
FT_HTML | HTML Native (CB/File Type) |
FT_RTF | Rich Text Format (CB) |
FT_CSS | Cascading Style Sheet |
FT_LOG | Log File (Text) |
FT_WORD | Microsoft Word |
FT_POWERPOINT | Microsoft PowerPoint |
FT_PDF | Portable Document Format |
FT_WORDPERFECT | WordPerfect |
FT_PAGEMAKER | Adobe PageMaker |
FT_INDB | Adobe InDesign Book (INDB) |
FT_INDD | Adobe InDesign Document (INDD) |
FT_IDML | Adobe InDesign XML (IDML) |
FT_SEC_MESSAGE | SEC Acceptance/Suspense Message |
FT_CSV | CSV (CB) |
FT_XML | XML (non-specific) |
FT_XSD | XML Style Data (non-specific) |
FT_RSS | Really Simple Syndication XML Data |
FT_EXCEL | Microsoft Excel |
FT_IXBRL | Inline XBRL File (XHTML) |
FT_XBRL | XBRL File Group Member |
| FT_XBRL_INS | Instance (main) |
| FT_XBRL_SCH | Schema |
| FT_XBRL_CAL | Calculation |
| FT_XBRL_DEF | Definition |
| FT_XBRL_LAB | Label |
| FT_XBRL_PRE | Presentation |
| FT_XBRL_REF | Reference |
FT_XFR | XBRL Financial Report (PSG, XDS) |
FT_XFDL | XFDL (EDGAR and Sec16 Filing) |
FT_XML_SECTION_16 | Section 16 XML (EDGAR) |
FT_XML_FORM_13F | Form 13F XML (EDGAR) |
FT_XML_FORM_13F_TAB | Form 13F Table XML (EDGAR) |
FT_XML_FORM_13H | Form 13H XML (EDGAR) |
FT_XML_FORM_17A | Form X-17A-5 XML (EDGAR) |
FT_XML_FORM_17H | Form 17H XML (EDGAR) |
FT_XML_FORM_C | Form C XML (EDGAR) |
FT_XML_FORM_CFP | Form CFPORTAL XML (EDGAR) |
FT_XML_FORM_D | Form D XML (EDGAR) |
FT_XML_FORM_MA | Form MA XML (EDGAR) |
FT_XML_FORM_N_CEN | Form N-CEN XML (EDGAR) |
FT_XML_FORM_N_MFP | Form N-MFP XML (EDGAR) |
FT_XML_FORM_N_MFP1 | Form N-MFP1 XML (EDGAR) |
FT_XML_FORM_N_PORT | Form N-PORT XML (EDGAR) |
FT_XML_FORM_N_SAR | Form N-SAR XML (EDGAR) |
FT_XML_FORM_SDR | Form SDR XML (EDGAR) |
FT_XML_FORM_SDR_EXHIBIT | Form SDR XML (EDGAR Exhibit) |
| FT_XML_FORM_SDR_EX_A | Exhibit A - Controlling Persons |
| FT_XML_FORM_SDR_EX_B | Exhibit B - Chief Compliance Off |
| FT_XML_FORM_SDR_EX_C | Exhibit C - Director Governors |
| FT_XML_FORM_SDR_EX_G | Exhibit G - Affiliates |
| FT_XML_FORM_SDR_EX_I | Exhibit I - Service Provider Con |
| FT_XML_FORM_SDR_EX_T | Exhibit T - Subscriber Information |
FT_XML_FORM_TA | Form TA XML (EDGAR, all) |
FT_XML_EDGAR | EDGARLink Online (EDGAR XML) |
| FT_XML_EDGAR_S16 | EDGARLink Online (Section 16 Only) |
FT_XML_FORM_ABS | Form ABS XML (EDGAR) |
| FT_XML_ABS_AUTOLEASE | Auto Lease |
| FT_XML_ABS_AUTOLOAN | Auto Loan |
| FT_XML_ABS_CMBS | Commercial Mortgage |
| FT_XML_ABS_DS | Debt Securities |
| FT_XML_ABS_RMBS | Residential Mortgage |
| FT_XML_ABS_NOTES | Disclosure Notes (Ex-103) |
FT_XML_REG_A | Regulation XML (EDGAR) |
FT_NSAR | NSAR Data (answer.fil) |
FT_BITMAP | Bitmap (CB) |
FT_GIF | Graphics Interchange Format (CB) |
FT_JPEG | JPEG Image Format (CB) |
FT_PNG | Portable Network Graphic (CB) |
FT_ZIP | Zipped/Compressed |
FT_GOFILER_PROJECT | GoFiler Project File (v 1.x & 2.x) |
FT_GOFILER_PROJECT_3X | GoFiler Project File (v 3.x) |
| FT_GFP_3X_ELO | Normal EDGAR Link Online |
| FT_GFP_3X_13H | Form 13H |
| FT_GFP_3X_13F | Form 13F |
| FT_GFP_3X_MA | Form MA |
| FT_GFP_3X_SDR | Form SDR |
| FT_GFP_3X_RGA | Regulation A |
| FT_GFP_3X_17A | Form X-17A-5 |
| FT_GFP_3X_C | Form C |
| FT_GFP_3X_CFP | Form CFPORTAL |
| FT_GFP_3X_17H | Form 17H |
| FT_GFP_3X_TA | Form TA |
| FT_GFP_3X_CEN | Form N-CEN |
| FT_GFP_3X_NPT | Form N-PORT |
| FT_GFP_3X_S16 | Section 16 (Combined) |
It is important to note that you can also switch between the codes and strings with the following two functions:
dword = FileTypeStringToCode ( string code );
string = FileTypeCodeToString ( dword code );
These functions simply take the code in one format and change it to the other. This can be useful if you want to be efficient with memory by using the dword codes but then want to use the more human friendly version in a log file.
The last function I want to discuss is a more powerful version of the GetFileTypeCode and GetFileTypeString functions. This is the GetFileTypeData function.
string[] = GetFileTypeData ( string source );
This function works like the other two but instead of returning a code or string it returns an array of properties about the file. This goes beyond the size and modified time but into the file’s meta data (if the application knows how to read it for the file). For example, running this function on a GoFiler project would give you the following properties:
FileTypeCode: 0x00007905
FileTypeString: FT_GOFILER_PROJECT_3X
ExtensionTypeCode: 0x00007904
ExtensionTypeString: FT_GOFILER_PROJECT
TypeDescription: GoFiler Project
FilePath: C:\Users\david.theis\Desktop\XBRL Testing\
FileName: test.gfp
FileSize: 6173
FileCreateTime: 2018-04-10T16:13:48
FileModifiedTime: 2013-11-15T16:28:46
MetaAuthor: David Theis
MetaKeywords: 09-30-2012
MetaSubject: 0000990681
MetaTitle: 10-Q
The properties included the type of the file as well as the creator of the project, the report period, CIK and form type. This information means you can put meta information on a dialog about a user’s chosen file to help them verify it was the proper choice. Additionally, for image files you can get the dimension of the picture.
Now that you know how to check the contents of files be sure to use this knowledge to improve your next script. With Legato, checking files is easier than ever.
David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission. |
Additional Resources
Novaworks’ Legato Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato