A few weeks ago we started building a Section 16 Data Scraper with the intention of a user being able to enter a CIK and Legato downloading all of the CIK’s Section 16 filings, putting the data together into a spreadsheet, and then being able to start a Section 16 filing by using that data as an entry point. At the end of the last post we had a script that would go out and download all of a singular CIK’s filings. It would also cache any filings it saw so that we were only downloading each individual filing once. This week we are going to continue onward and code a system to scrape useful data out of the filings and put it together into a CSV file.
Friday, February 08. 2019
LDC #122: Building a Section 16 Data Scraper, Part 2: Aggregation
Section 16 filings are XML files, but unfortunately there is no EDAC for them yet (for an introduction to EDAC, see Steve’s blog post last week). This means in order to scrape out information we have to rely upon parsing through the XML data manually. To do this we are going to use the SGML parser. This is similar to how we have parsed through HTML code in the past, as both HTML and XML are styles of SGML.
As we parse through each document we are going to store a number of values into a datasheet, which we will then write out to a CSV file before we finish. Later this will allow us to get all of the transaction data from a single source rather than having to recompile the data each time the script is run. Using a datasheet for creating a CSV is akin to using a chainsaw to cut through paper, but I use it rather than a CSV object because it allows me to get the ordering correct without having to do a lot of work. In the end we want to store values from newest to oldest, but if you have read part one you may remember that we left our array ordered from oldest files to newest files. If you haven’t read part one I’d recommend doing so. So we are going to use the datasheet object to create a table that we can continuously put in values as the second row from the top, right underneath the header, and the datasheet will move all of the rest of the rows down.
Now let’s take a look at the full text of this weeks script:
/***************************************************************************************************************** Section 16 Data Collector ------------------------- Revision: 01-04-19 JCK Initial creation - get filings from SEC 02-08-19 JCK Part 2 - Store information Notes: -Stage 2 (c) 2019 Novaworks, LLC. All Rights Reserved. *****************************************************************************************************************/ #define TEST_CIK "0001214156" handle hSheet; handle file; boolean IsSection16 (string type); void addTransaction (string issuer); void main() { //creation variables string files[]; //Files from SEC string s16filings[]; //Table for us string cache; //Location of cache string name; //Name of file boolean s16Check; //Placeholder boolean int rc; //Error Checker string c_cik; //CIK of Reporting Owner string base; //String for name storage string parser; //Parse output c_cik = TEST_CIK; //Get the archive list files = EDGARFetchArchiveList(c_cik); rc = GetLastError(); //Check is CIK exists if (rc == (ERROR_REMOTE | ERROR_FILE_NOT_FOUND)) { AddMessage("CIK Does Not Exist"); return; } //If too many filings if (rc == ERROR_OVERFLOW) { files = EDGARFetchArchiveList(c_cik, FALSE, 0, 0, TRUE); } //Initialize needed variables int count; int numfiles; int max; max = ArrayGetAxisDepth(files); count = 0; numfiles = 0; //Find the location for our cache cache = GetScriptFolder() + "\\Cache\\" + c_cik; CreateFolders(cache); if (IsFolder(cache) == FALSE) { AddMessage('x', "Unable to create document cache."); return; } //Open progress bar ProgressOpen("Getting Files"); ProgressSetPhaseCount(max-1); //While there are still files to go through while (count < max) { //Update progress ProgressSetPhase(count); ProgressSetStatus(FormatString("Getting file %d of %d", count, max)); ProgressUpdate(count, max); //Get filename base = ReplaceInString(GetFilename(files[count]), ".txt", ".xml"); //Check if already downloaded if (DoesFileExist(AddPaths(cache, base)) == FALSE) { //Open archive file = EDGARArchiveOpen(files[count]); //Make sure a S16 filing s16Check = IsSection16(EDGARArchiveGetDocType(file, 0)); //If s16 if (s16Check) { //Set name of XML file name = AddPaths(cache, EDGARArchiveGetProperty(file, "accession_number") + ".xml"); //Get XML file and save it locally EDGARArchiveGetDocFile(file, 0, name); } //Add to the list s16filings[numfiles++] = name; CloseHandle(file); } else { //Add cached file to list s16filings[numfiles++] = AddPaths(cache, base); } count++; } //Close progress ProgressClose(); //Get number of filings count = ArrayGetAxisDepth(s16filings) - 1; //Create data sheet hSheet = DataSheetCreate(2, 8); //Set the header row DataSheetSetCellText(hSheet, 0, 0, "Acquisition or Disposition"); DataSheetSetCellText(hSheet, 0, 1, "Transaction Date"); DataSheetSetCellText(hSheet, 0, 2, "Transaction Type"); DataSheetSetCellText(hSheet, 0, 3, "Direct or Indirect Ownership"); DataSheetSetCellText(hSheet, 0, 4, "Number of Securities Transacted"); DataSheetSetCellText(hSheet, 0, 5, "Number of Securities Owned"); DataSheetSetCellText(hSheet, 0, 6, "Issuer CIK"); DataSheetSetCellText(hSheet, 0, 7, "Security Name"); //Create SGML parser file = SGMLCreate(); //Go through filings while (count >= 0) { AddMessage("Parsing file %s", s16filings[count]); //Set parse to this file rc = SGMLSetFile(file, s16filings[count]); //Check for errors if (rc != ERROR_NONE) { AddMessage("Error on Setting File: 0x%08X", rc); } //Get first element parser = SGMLNextElement(file); rc = GetLastError(); //Check for issuer CIK while (parser != "<issuerCik>" && rc == ERROR_NONE) { parser = SGMLNextElement(file); rc = GetLastError(); } //Get CIK c_cik = SGMLNextItem(file); if (rc != ERROR_NONE) { AddMessage("Error on retrieving CIK: 0x%08X", rc); } //Until end of file while (rc == ERROR_NONE) { //check if element is beginning of transaction if (parser == "<nonDerivativeTransaction>" || parser == "<derivativeTransaction>") { //Add transaction addTransaction(c_cik); } //Next element parser = SGMLNextElement(file); rc = GetLastError(); } //Next file count--; } //Done with SGML parser CloseHandle(file); //Export data to CSV rc = DataSheetExport(hSheet, AddPaths(cache, c_cik + ".csv")); if (rc != ERROR_NONE) { AddMessage("Error on Creating CSV: 0x%08X", rc); } CloseHandle(hSheet); AddMessage("Finished Parsing files from CIK: %s", TEST_CIK); } //Returns true if S16 file type, false if anything else boolean IsSection16(string type) { switch (type) { case "3": return true; case "4": return true; case "5": return true; case "3/A": return true; case "4/A": return true; case "5/A": return true; } return false; } void addTransaction(string issuer) { string aord, date, ttype, dori, numtrans, numowned, secname; string parser; int rc; //Initialize variables aord = ""; date = ""; ttype = ""; dori = ""; numtrans = ""; numowned = ""; secname = ""; //Get next element parser = SGMLNextElement(file); //Check for end of the transaction while (parser != "</nonDerivativeTransaction>" && parser != "</derivativeTransaction>" && rc == ERROR_NONE){ if (parser == "<securityTitle>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } secname = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ secname += SGMLNextItem(file); } secname = ReplaceInString(secname, "</value>", ""); } if (parser == "<transactionDate>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } date = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ date += SGMLNextItem(file); } date = ReplaceInString(date, "</value>", ""); } if (parser == "<transactionCode>"){ ttype = SGMLNextItem(file); } if (parser == "<transactionShares>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } numtrans = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ numtrans += SGMLNextItem(file); } numtrans = ReplaceInString(numtrans, "</value>", ""); } if (parser == "<transactionAcquiredDisposedCode>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } aord = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ aord += SGMLNextItem(file); } aord = ReplaceInString(aord, "</value>", ""); } if (parser == "<sharesOwnedFollowingTransaction>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } numowned = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ numowned += SGMLNextItem(file); } numowned = ReplaceInString(numowned, "</value>", ""); } if (parser == "<directOrIndirectOwnership>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } dori = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ dori += SGMLNextItem(file); } dori = ReplaceInString(dori, "</value>", ""); } parser = SGMLNextElement(file); rc = GetLastError(); } //Insert a row underneath the header row DataSheetRowInsert(hSheet, 1); //Insert all of the data DataSheetSetCellText(hSheet, 1, 0, aord); DataSheetSetCellText(hSheet, 1, 1, date); DataSheetSetCellText(hSheet, 1, 2, ttype); DataSheetSetCellText(hSheet, 1, 3, dori); DataSheetSetCellText(hSheet, 1, 4, numtrans); DataSheetSetCellText(hSheet, 1, 5, numowned); DataSheetSetCellText(hSheet, 1, 6, issuer); DataSheetSetCellText(hSheet, 1, 7, secname); }
It may be long, but a lot of it is simple, and there are many places where we take several lines of code to fill out different values in our datasheet. Let’s break down what we have added for this week. We add a couple of global variables, our parser handle and the currently open file handle. We will use those in our new function and with no easy way to pass by reference we just make them global. We also declare our new function: addTransaction. We pass the Issuer CIK to that function and at the end we will store the data from the transaction, including the issuer CIK, into our datasheet. The first half of our main function, where we get all the files for a CIK, is the same. The changes start to come right before we enter the loop through our array of file names.
//Create data sheet hSheet = DataSheetCreate(2, 8); //Set the header row DataSheetSetCellText(hSheet, 0, 0, "Acquisition or Disposition"); DataSheetSetCellText(hSheet, 0, 1, "Transaction Date"); DataSheetSetCellText(hSheet, 0, 2, "Transaction Type"); DataSheetSetCellText(hSheet, 0, 3, "Direct or Indirect Ownership"); DataSheetSetCellText(hSheet, 0, 4, "Number of Securities Transacted"); DataSheetSetCellText(hSheet, 0, 5, "Number of Securities Owned"); DataSheetSetCellText(hSheet, 0, 6, "Issuer CIK"); DataSheetSetCellText(hSheet, 0, 7, "Security Name"); //Create SGML parser file = SGMLCreate();
We create a datasheet with two rows and eight columns, and then we set the first row to be the headers of the sheet. They define the values that we will put in those columns later. We then also create our SGML parser. We can use the same SGML parser for every single file as the currently loaded file can be changed within the same object. This means instead of creating a large number of SGML parse objects, we can create just one, which is a much cheaper operation.
//Go through filings while (count >= 0) { AddMessage("Parsing file %s", s16filings[count]); //Set parse to this file rc = SGMLSetFile(file, s16filings[count]); //Check for errors if (rc != ERROR_NONE) { AddMessage("Error on Setting File: 0x%08X", rc); } //Get first element parser = SGMLNextElement(file); rc = GetLastError(); //Check for issuer CIK while (parser != "<issuerCik>" && rc == ERROR_NONE) { parser = SGMLNextElement(file); rc = GetLastError(); } //Get CIK c_cik = SGMLNextItem(file); if (rc != ERROR_NONE) { AddMessage("Error on retrieving CIK: 0x%08X", rc); }
Now we start going through each file. For each file we set the SGML parser to that file. This resets the positioning of the parser to the beginning of that file. After each major operation we are going to do a quick error check to make sure that we are not encountering any errors. Next we get the first element of the file using SGMLNextElement(). We also get the error code from that operation. Then we enter a while loop which we will use until we come across the issuerCik element. The SGML parser supports HTML DTDs, so if we were using a DTD and parsing HTML we could check to see if the element matched a DTD element, but since XML has no defined DTD we are forced to check the string name elements complete with the brackets on the ends. Our while loop goes until we see issuerCik or an error is encountered. Each time through the while loop we get the next element in the file and the error code from that operation. We constantly check for errors because most of the time errors that we encounter while parsing through a file means that we have reached the end of the file. This is because our source documents are from the SEC’s website where the XML has already been checked for errors. If we were not checking to see if we were in an error state we would get ourselves into an infinite loop. Once we find the issuerCik element we get the next item and store it as the CIK.
//Until end of file while (rc == ERROR_NONE) { //check if element is beginning of transaction if (parser == "<nonDerivativeTransaction>" || parser == "<derivativeTransaction>") { //Add transaction addTransaction(c_cik); } //Next element parser = SGMLNextElement(file); rc = GetLastError(); } //Next file count--; } //Done with SGML parser CloseHandle(file); //Export data to CSV rc = DataSheetExport(hSheet, AddPaths(cache, c_cik + ".csv")); if (rc != ERROR_NONE) { AddMessage("Error on Creating CSV: 0x%08X", rc); } CloseHandle(hSheet); AddMessage("Finished Parsing files from CIK: %s", TEST_CIK); }
Once we have the issuer CIK we can go until the end of the file. We enter a while loop and check to see if we are in an error state. If we are not, we check to see if the element that we are on is the beginning of a transaction element. If so, we enter our addTransaction function, which we will go through in a bit. Whether or not we are at the beginning of a transaction, the next thing that we do here is to get the next element and error code and then go through the while loop again. We continue going through the loop until we reach an error state, which means that we have reached the end of the file. At that point we go on to the next file and repeat the process.
After all of the files have been gone through we close the handle on our SGML parser and we export the datasheet object to a CSV file in our cache folder. At that point we have reached the end of our script. That means we now need to take a look at the bulk of the work being done: the addTransaction function.
void addTransaction(string issuer) { string aord, date, ttype, dori, numtrans, numowned, secname; string parser; int rc; //Initialize variables aord = ""; date = ""; ttype = ""; dori = ""; numtrans = ""; numowned = ""; secname = ""; //Get next element parser = SGMLNextElement(file); //Check for end of the transaction while (parser != "</nonDerivativeTransaction>" && parser != "</derivativeTransaction>" && rc == ERROR_NONE){ if (parser == "<securityTitle>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } secname = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ secname += SGMLNextItem(file); } secname = ReplaceInString(secname, "</value>", ""); }
The first thing that the function does is sets a whole bunch of string variables to be blank. We are going to reference all of these strings later when we add our transaction row to the datasheet, but there is no guarantee that all of the values are going to be found (they should, but we want to make sure we do not accidentally introduce errors), so we make sure that the values are at least set to something. We then get the next element from our SGML parser, which is global so it is the same file and location that just sent us into this function. This means we know for sure that we are at the beginning of a transaction. We then check to make sure we are not at the end of the transaction or in an error state. We then enter a bunch of if statements. I’m highlighting this one here, but this is the same logic for all of the rest of the statements in this function. We check to see if the element is the beginning of a value we want to store, in this case the security title. The way the XML looks in this case is
<securityTitle> <value>Restricted Stock Unit</value> </securityTitle>
We search for “securityTitle” because we want to then store what is in the value element. Once we hit “securityTitle” we continue onwards until we hit “value”. At that point we start capturing the next item until the next item is a tag. At this point then our secname variable is “Restricted Stock Unit</value>". We then remove “</value>" from the end of the string, and the string has just the value in it now. I skipped error checking in this portion of the loop because all of these files are from the SEC and therefore have gone through validation. Once we hit the fact that “securityTitle” exists we can just go until “value” and “/securityTitle” because the file would not have passed the SEC’s validation otherwise. We do this same logic loop for the rest of the variables declared at the beginning of the function.
if (parser == "<transactionDate>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } date = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ date += SGMLNextItem(file); } date = ReplaceInString(date, "</value>", ""); } if (parser == "<transactionCode>"){ ttype = SGMLNextItem(file); } if (parser == "<transactionShares>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } numtrans = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ numtrans += SGMLNextItem(file); } numtrans = ReplaceInString(numtrans, "</value>", ""); } if (parser == "<transactionAcquiredDisposedCode>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } aord = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ aord += SGMLNextItem(file); } aord = ReplaceInString(aord, "</value>", ""); } if (parser == "<sharesOwnedFollowingTransaction>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } numowned = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ numowned += SGMLNextItem(file); } numowned = ReplaceInString(numowned, "</value>", ""); } if (parser == "<directOrIndirectOwnership>"){ while (parser != "<value>") { parser = SGMLNextItem(file); } dori = SGMLNextItem(file); while (SGMLGetItemType(file) != SPI_TYPE_TAG){ dori += SGMLNextItem(file); } dori = ReplaceInString(dori, "</value>", ""); } parser = SGMLNextElement(file); rc = GetLastError(); }
The only exception is transaction code, which has the value not stored in a <value> tag, but instead is just in the main tag. This simplifies our logic loop greatly as we can just get the next item; we also do not have to care about getting multiple items because we know that the transaction code is always a single letter.
Once we check all of the possible values we get the next element and continue our while loop.
//Insert a row underneath the header row DataSheetRowInsert(hSheet, 1); //Insert all of the data DataSheetSetCellText(hSheet, 1, 0, aord); DataSheetSetCellText(hSheet, 1, 1, date); DataSheetSetCellText(hSheet, 1, 2, ttype); DataSheetSetCellText(hSheet, 1, 3, dori); DataSheetSetCellText(hSheet, 1, 4, numtrans); DataSheetSetCellText(hSheet, 1, 5, numowned); DataSheetSetCellText(hSheet, 1, 6, issuer); DataSheetSetCellText(hSheet, 1, 7, secname); }
Once we reach the end of the transaction and break out of the loop it is time to store the values we just scraped. We insert a row into the first position of the sheet (of a zero based position), which means that we are inserting the second row. Inserting a row pushes down all of the existing rows, meaning that when we look at the CSV after execution the transactions will be new to old. We set all of the cell text to match the header row that we created before we started parsing files and then we exit the function. Since there is no failure or success in this function we leave the return type to be void and we just exit the function naturally by reaching the end of the function.
Now when we run the script we are left with a number of files in our cache folder: all of the Section 16 filings done by a CIK and then a CSV file that contains all of the individual transactions that are reported within these fillings. The next step will be transforming this process into something that a user can control, so in the next installment we will be creating a user interface so that we no longer have to hard code our CIK values into the script itself.
Joshua Kwiatkowski is a developer at Novaworks, primarily working on Novaworks’ cloud-based solution, GoFiler Online. He is a graduate of the Rochester Institute of Technology with a Bachelor of Science degree in Game Design and Development. He has been with the company since 2013. |
Additional Resources