Quantified must read: What exactly is Tick data and why is it so hard to find reliable transaction data?

Author: The Little Dream, Created: 2016-11-02 19:33:56, Updated: 2016-11-02 19:48:20

  • One, what is Tick Data?

Tick Data itself is not a mystery, as the exchange sends you the active order book of each stock (or futures options) (i.e. your orders still exist on the exchange but have not been synthesized).

**举例说明:**
  某天的市场一开始的时候苹果股票的order book(委托挂单)清空(这里不进行auction period的探讨):
  1. 接着来了第一个卖家:1000@100 :
  这时候交易所会发给你一个message,告诉你是苹果股票有人想以100块钱卖出1000股,
  那么这个order就先挂在了order book上,成为卖一。

  卖:1000@100


  2. 第二个卖家来了,他想卖得更高: 1000@101:
  这时候交易所会发给你另一个message,告诉你是苹果股票有人卖的价格比你差,于是排序在更上面,卖二。

  卖:1000@101

  1000@100


  3. 刚才的第一个卖家后悔了,cancel了他的order:1000@100撤消了,那么交易所会有message告诉你,
  现在只剩一个1000@101(卖一)。但是你可能需要自己编程处理这种remove掉一个tick的情况。

  卖:1000@101


  4. 终于有买家来了... 500@90 , 这个价格是不会成交的,因为买家低于现在的最佳卖价:101,
  那么order book里面会继续存着这个order,同时会发送一个tick告诉市场上的其他人,有买单了:

  卖:1000@101

  买:500@90


  5. 继续,接着有一位买家以101块钱买入1000股,等于要把目前的bestoffer 1000@101给match - 撮合了,那么你是不会收到这个最新的bid: 101@1000 的,
  因为它会进入matching engine的瞬间跟对面的best offer 撮合了,tick table的一个规则: bid offer 永远不会cross,
  否则要么是数据商的bug,要么是交易所的bug。现在,你只会收到一个告诉你delete the best offer的message,那么tick table长这样:

  买:500@90

Tick data is as simple as that, and the market will repeat the process.But what's more troubling is:

- 1. Many times tick data is sent in UDP, imagine if the stock market is very active, then the data volume will be very large, UDP will be lost, how to handle.

- 2. How to process real-time tick data faster, otherwise the data volume is so large that once delayed, you will never be able to keep up with the real-time tick rate again until your program hangs up.

- 3. How to avoid some special situations causing a bug, once a tick is not counted correctly, then the tick table behind it is all wrong:)

** Also, there is a problem of understanding ticks: different markets have different ticks, the above mentioned are the stock markets in developed countries, which are pushed in real-time (there is a new order and within the tick's sending level, for example, the Tokyo exchange only sends 8 tick levels, then you do not see the full tick, because there may be more than 100 levels, if many people are trading). How many milliseconds of domestic time is a snapshot of a snapshot of a 3 second tick, and then send it to you, it is likely that the domestic trading system is very old and can't keep up with IT development. Then this tick data is not real time tick, you just know wow!

(This article was compiled by the quantitative trader WeChat id:quantcity..)

  • Two: What are some details of snapshot data and exchange data?

For foreign high-frequency tick data, there is a complete order data process, so you can use this order data to restore snapshot data.

The two largest stocks and four largest futures in the country are theoretically snapshot data. For example, typical data fields include: What do you mean? Opening price The highest price The lowest price The most recent price The amount of transactions The amount of transactions What do you mean? The highest (lowest) price here is the highest (lowest) price that the transaction has occurred since opening until now, assuming you have detailed details of each transaction, but this data can be inferred with max (min), so foreign tick data generally does not have this field. What do you mean? There are three types of real-time transactions offered by exchanges and central banks, snapshot and one-to-one transactions and assignments. What do you mean? A snapshot is a photograph of the market every 3 seconds, and then the current price, maximum, minimum, volume, transaction amount, etc. are sent. Since the photograph is taken every 3 seconds, we do not know what happens in the market during this 3 second period. The daily continuous bidding time is 4 hours in the morning and 2 hours in the afternoon. What do you mean? Transaction by transaction is a transaction per real atom. However, this data is also sent in a batch of 3 seconds, and not in real time. For example, a transaction that occurs in 1.5 seconds is sent only in 3 seconds. What do you mean? In the second level, only the top 50 buyers and sellers are listed, not all of them. (This article was compiled by the quantitative trader WeChat id:quantcity..)

**典型的有几类原因导致数据的差异**
- **1. 数据记录方式**

For example, if a stock exchange publishes a DBF file that records all the latest status data for the stock, the DBF file is automatically refreshed. So the data provider or the person who records the data needs to read the file every few seconds and then put all the data in the database, but because the exchange does not update the data at a single rate, the best way to avoid missing the data is to read it more often than it is updated. Because there is such a rule, you see some non-active securities with less data than active securities, long-term futures with less data than recent ones, timing discrepancies and so on.

- **2. 运维问题**

No one can guarantee that the network will not be disconnected. If there is a disconnection, a machine error, a program error, etc., you will miss the exchange data playback. According to the above-mentioned data mechanism, there is no logical correlation between Level 1 data T and T + 1 moments, assuming that the missing you can not find from the data itself, so a large number of missing are actually caused by these reasons, and cannot be compensated!

- **3. 程序导致的数据错误**

Some of the more unusual errors, such as the price of certain types of stocks being abnormal, void, etc., may be caused by errors in the recording process. Why? It is therefore difficult in principle to have 100% reliable data, the checking and cleaning of data is necessary, it is also a boring thing, the establishment of rules also depends on personal experience.


More