Frequent-Check Machines

People use us because they want to keep up

Trounceflow takes care of the big problem of getting - fast - the data that reveals demand. That's why people use us: because we help them keep up.

It's hard to keep up when data is Quiet

One of the reasons why it is difficult to keep up yourself, without Trounceflow, is that in most cases this data is what we call Quiet. What that means is there is no notification - no "push" - from the data generator (the source of the data) to the rest of the world when the data has been updated, and so you have to keep checking the source yourself, and when it has, you have to "pull" it in.

We have invented frequent-check machines

Having someone on the team check frequently would mean you would be fast - there would be only a short lag between the time that the data source was updated and when you send it to clients - but paying people to check frequently could become expensive. The good news is that we have invented several types of frequent-check machines, and we have built many of each type. They have an initial build cost, but their running costs are low, and so we are spending money on building more because in the long run it is cost effective.

Our simple "Page Observer" checking machines

The simpler of our machines are our "Page Observers". All these do is examine a webpage (their input is the page URL) and tell you if there has been a change of any kind on the webpage (their output is a true or false answer to the question: Has the web page changed?). There are two techniques they make use of: (1) Hashing functions, like Md5, and (2) ETags. While these machines don't tell you whether the data you are interested in has been updated - there may be many things on the webpage that could change apart from the data you are interested in - they do at least tell you not to bother having a person check the webpage when they report no change to the page.

Our powerful "Crawler" checking machines

The more powerful of our machines are our "Crawlers". They are not looking for change. They are there to ingest the latest data. If the data has not been updated, then the latest data ingested now will be the same as the latest data ingested last time there was a "crawl".

Output Example: Foreign Holders of Malaysia

One of our outputs to clients is time-series data on the foreign holdings of local-currency-denominated government bonds. In the case of Malaysia, this data is shown on the Trounceflow Malaysia Country Page, and we have sourced it from a spreadsheet that is updated at monthly frequency that is to be found on a Malaysian government website.

Input Source: Malaysian government website

The Malaysian government website is that of the central bank, Bank Negara Malaysia (BNM), and what we call the Base URL of that website is https://www.bnm.gov.my and then there are section URLs and in our case we are looking for https://www.bnm.gov.my/-/monthly-highlights-and-statistics-in-october-2021.

Input File: Spreadsheet

So on that webpage there are lots of .xls files, and we want a particular one called 3.2 RENTAS: Foreign Holdings in Debt Securities and Sukuk.

Web spiders to web crawl​

Web Spiders is the name we give for a type of tool we build. This type of tool performs a task called web crawling. Crawling is the technical term for accessing a website and obtaining information via a software program. We build web spiders to save time checking to see if the information on websites has changed, because if it has, our clients want to know about it sooner rather than later, so we help pipe the new data as fast as possible.

There is Spiders README that describes tech details of spiders development.
The complete list of spiders can be seen here in Django Admin, and you can see whether they are flagged as 'active' or not, and of what type (country, fund or '-') they are.

We have coded a web page on the App called 'Spider Management' to allow the Trounceflow Team to view the status of the active managed spiders Active managed spiders are those that are included in the daily spider run. Spider runs can be successful, or they can fail.

We have not built as many as we would like to. So, we still pay people to obtain some data from websites. We don't hack any websites, or go through firewalls. We don't need to. The kind of data we look for is data providers want us to find. Some spiders are just a few lines of code, some a few hundred.
There are two types of information we use spiders to obtain:

  1. Whether the page has changed;
  2. The most recent version of a file or information on the page.

Thus there are two types of spiders:

  1. Page Observers are our spiders that get the first type of information. They are very simple programs that use hashing functions (Md5) or ETags to detect change. They save states of observed pages as Page States.
  2. Managed Spiders get the second type of information. They need more work to develop, but they are still simple as they use pre-existing programs (libraries) to help:
    - Requests allows you to easily send HTTP/1.1 requests;
    - Selenium automates web browsers and is more powerful;
    - Beautiful Soup extracts data from HTML and XML files.
    Managed Spiders often get files, especially PDF or XLS files.
    Also they stores files in Amazon S3.

SITUATION

There are more than 300 'active' spiders at Trounceflow. An 'active' spider is one that is included in the daily spiders run.
Spider runs can be successful or failed.
Spider run can be failed due to different reasons. For example:

  • because spider's source url is changed so spider that was developed to get the file from previous url needs a fix to work with that new source url;
  • because source page content is changed (even when url is same) so spider needs a fix to work with updated content of the source page;
  • because there are can be updates to the libraries and dependencies so spider needs according edits to have its code not obsolete.
Monitoring spiders runs and fixing failed ones is one of developers constant tasks. We are always working on improvements.

Task Scheduling Machines

Background to Messaging and Tasks

In one, a page hashing function can determine if there has been a change. In another, a check of the ETag of a page will identify whether there has been a change.

Then in that case, a Trounceflow manager could send an SMS (text) message, WhatsApp message, Slack message or email message to a Trounceflow employee to ask them to check the web page concerned in some periodic way, for example *every six hours, starting on the 4th day of the month at 05:59* (thus therefafter at 11:59, 17:59 and 23:59am), and to only stop checking when the employee detected a specific change in the web page, such as the appearance of (new files containing) new information.

Suppose that the information in question (being checked for updates) concerns the types of holders of Malaysian sovereign bonds at month-end, which is a multivariate time-series (a vector of values) that is released with a variable lag of about a week after the month-end.

Then - once the task is completed (the employee, say, discovers at 6pm on the 6th day of September an updated .pdf file has been uploaded to the Malaysian government website with the vector of holders for 31 August) - the task need not be restarted until, say, the 4th day of the following month.

Another way of getting this work (of checking for updated information on a web page) done is to (i) have a checking tool, which would (ii) have the checking tool visit the web page autonomously.

Visiting and Checking tools (Page Observers)

Celery is a Python library / a task queue that allows for the running of scheduled tasks asynchronously. It receives tasks with their related data, runs them and delivers the results.

Scheduling is the process of assigning resources to perform tasks. For example, assigning a worker to perform a periodic task, like webcrawling (setting web spiders to work) every ten minutes, running spiders and parsers, import from Bloomberg, import from XLS/CSV files etc.

Celery Beat is the part of Celery for scheduling periodic tasks. It can be controlled from Django, hence we talk about Django-Celery-Beat.

Django Admin at Trounceflow has a Periodic Tasks area that is directly related to celery beat scheduling. It can be opened by this link.

Analysts/editors/administrators can control the frequency and timing of the scheduling without any touch to code:
- Interval Schedule - for simple scheduling, e.g. do it every ten minutes;
- Crontab Schedule - for more complex scheduling, e.g. do it at 8pm everyday.

The recommended message brokers for Celery are:
- RabbitMQ, which we use (RabbitMQ is a Message Queue (MQ). It receives messages and delivers messages);
- Redis (we don't use it here but we use it for caching the website).

Message Queues help applications communicate with each other. They enable them to exchange information: 'messages' (packets of data that applications create for other applications to consume).

Messaging is 'asynchronous' ('in the background'): message queues store messages in the order they are transmitted until the consuming application can process them (as opposed to 'synchronous' ('wait until ready')

Message Brokers also help applications communicate with each other and exchange information, even if they were written in different languages or on different platforms

Parsers to extract information

Parsing is the term for automatically extracting information (and loading it into a database).
Parsers is the name we give for a type of tool we build for implementation of this process.

We can, and do, pay people to manually parse information. But it can be worth developing a parser to do that repetitive job.
Trounceflow's software developers have written hundreds of parsers to obtain data from hundreds of sources. The source of the information can be a PDF document, an Excel file, a Web Page, or something else.

Parsers are written from scratch in Python, but incorporate pre-existing technologies, including:
- bs4 (BeautifulSoup) for reading HTML files;
- Selenium;
- xlrd (for reading data from Excel files);
- pdftotext (for reading PDF files).

There is Parsers README that describes tech details of parsers development.
The complete list of parsers can be seen here in Django Admin, and you can see whether they are flagged as 'active' or not, and of what type (country, fund or '-') they are.

We have coded a web page on the App called 'Parser Management' to allow the Trounceflow Team to view the status of the parsers. When parsing is successful, success is 'true' (otherwise 'false').

Most of our parsers are not for parsing PDF files. Rather, they are for parsing webpages that have been updated (each day, each week, each month, each quarter) or for parsing spreadsheets that have been uploaded to those websites. Parsing PDF files is surprisingly difficult. We collect a lot of data from 'country' websites, e.g. the National Treasury of South Africa, and so most of our parsers are 'country' parsers.