对于mysql,redis,Kafka等磁盘缓存技术分析(Analysis of disk cache technologies such as mysql, redis and Kafka)

  大部分组件是基于磁盘存储的,但由于CPU速度和磁盘速度之间的鸿沟,都会使用缓存技术来提高性能,缓存简单来说就是一块内存区域,首先将从磁盘读到的数据放在缓存中,之后查询或修改时直接操作缓存,对于缓存中的数据则以一定的频率刷新到磁盘上,怎样缓存,缓存多少,何时刷新,这些影响着整个组件的性能。在看过一些关于mysql等组件的架构原理后,会发现不论是基于磁盘的mysql数据库和Kafka消息中间件zookeeper分布式协调框架,还是基于内存的redis数据库,它们都设计了完善的内存和磁盘之间数据交互实现。在快速读取数据和持久化保存数据中做出平衡。缓存还有空间和时间读取规则,从空间角度热点数据相邻区域的数据不久之后也会被访问,从时间角度热点数据第一次访问后还会被继续访问到。

  mysql磁盘缓存(仅在使用Innodb引擎下)

  分析mysql将哪些数据进行缓存时,可以找到它的根源来看,即mysql中innodb引擎的缓存池。

  当然innodb中可以设置多个这样的缓存池实例,从而增加数据库的并发能力,缓存池的大小是可以配置的,缓存池中每个页的大小为16KB,通过LRU算法来管理缓存池,当LRU列表中的页被修改后,因为与磁盘中的数据产生不一致将该页称为脏页,这是数据库会通过CHECKPOINT机制将脏页刷回磁盘,脏页也会存在与Flush列表中,Flush与LRU列表互不影响,LRU列表管理缓存池中页的可用性,而Flush列表管理页刷新回磁盘,脏页数量可以通过命令来查询。

  下面需要关注的是与磁盘文件相关联的缓存

  重做日志缓存

  重做日志缓存为innodb引擎独有,其对应着reco log文件,默认为8MB,因为一般情况下每秒钟会将重做日志刷新到日志文件,所以不需要设置的太大,通常在以下三种情况下会将重做日志缓存中的内容刷新到磁盘上重做日志文件中。

  Master Thread每秒钟将重做日志缓存刷新到重做日志文件每个事务提交时会将重做日志缓存刷新到重做日志文件(由innodb_flush_log_at_trx_commit控制)当重做日志缓存剩余空间小于1/2时,将重做日志缓存刷新到重做日志文件

  因为缓存和磁盘数据不可能实时保持一致,为了防止数据丢失,当前事务数据库都普遍采用Write ahead log策略,即当事务提交时先写重做日志,再修改页,当发生宕机导致数据丢失后,可以通过日志来进行数据恢复,保证了事务中持久性的要求。为得到高可靠性可以设置多个镜像日志组。

  数据页索引页缓存

  这里用到了innodb引擎的关键特性,插入缓存(Insert/Change Buffer)来对数据进行操作,Innodb对每张表都设置了主键,主键是行的唯一标识符,通常行记录的插入顺序也是安装主键递增的顺序进行插入,因此插入聚集索引一般不需要随机读取,但表中还会存在多个非聚集的辅助索引,当进行插入时,数据页的存放还是按聚集索引来顺序存放,而对于索引页中非聚集的辅助索引页更新存在离散访问,这样随机的读取会导致性能的下降,所以使用Insert Buffer来对辅助索引进行缓存,再根据一定频率与辅助索引页进行merge合并。

  二进制日志缓存(binary log)

  二进制日志记录了对mysql数据库执行更改的所有操作,但不包含对数据库本身没有修改的操作,如select和show,二进制日志用于数据库的恢复,主从数据同步的复制,对日志中的信息进行安全审计。

  注意,当使用事务的表存储引擎时,所有未提交的二进制日志会被记录到缓存中,等事务提交时将缓存中的二进制日志写入到二进制日志文件中,binlog_cache_size是基于会话而不是全局的,默认大小32K。

  默认情况下二进制日志并不是每次写的时候都会同步到磁盘,需要设置sync_binlog值来进行调整,默认值为0,表示MySQL不控制binlog的刷新,由文件系统自己控制它的缓存的刷新。这时候的性能是最好的。

  Undo日志缓存

  undo是逻辑日志,根据每行记录来进行记录,用来帮助事务回滚及MVCC的功能实现非锁定读取,undo日志存放于共享表空间里,通过全局动态参数innodb_purge_batch_size来设置每次purge需要清理的undo page数量,默认为300.

  但凡用了缓存肯定需要刷回磁盘,而刷回磁盘的操作由哪些线程来进行,一步步来就能发现mysql后台主要有以下四种线程。

  Master Thread:主要负责将缓存池中的数据异步刷新到磁盘中去(包括页刷新,合井缓存插入, 回收undo页

  IO Thread:主要负责请求的回调处理。((InnoDB 中请求大量使用了A,提高处理性 ) write , read , insert buffer , log IO thread .

  Purge Thread:事务被提交后,所需undolog可能不使用,用来回收undo页

  Page Cleaner Thread:用来刷新脏页

  以上便是mysql涉及到缓存和磁盘相关联的数据更新情况,主要包含四种日志和数据的同步。

  Redis磁盘缓存

  严格意义来说,redis与其他组件还是不同的,redis原生就支持在内存中使用,而将数据存放到磁盘中反而是可以配置的,并非一定需要将数据持久化,redis的主要作用是缓存数据,所以数据的持久保存应该由后端数据库来做,业务的场景也应该是先查redis,如果不存在则再去数据库中查找,过于依赖redis的数据持久化,可能会造成数据返回不一致。

  redis 的持久化机制有两种,第一种是快照,第二种是 AOF 日志。快照是一次全量备份,AOF 日志是连续的增量备份,这与之后要说zookeeper有点类似。快照是内存数据的二进制序列化形式,而AOP日志记录的是内存数据修改的指令记录文本,AOP日志在长期的运行过程中会逐渐变大,所以也会不断进行覆盖。快照可以配置频率,“save * ”:保存快照的频率,第一个 表示多长时间,单位是秒,第二个“*”表示至少执行写操作的次数,在一定时间内至少执行一定数量的写操作时,就自动保存快照,可设置多个条件。

  AOP日志

  redis在收到客户端指令,经过校验后会将该指令存储到AOF日志中,再去执行指令,保证在宕机后也能通过AOP日志的指令重放恢复到宕机前的状态。对AOP日志进行写操作时,实际上是将内容写到了内核为文件描述符分配的一个内存缓存中,然后内核会异步将脏数据刷回磁盘。linux提供fsync指令可以指定文件强制从缓存中刷新到磁盘,但如果redis实时调用fsync进行日志同步,这种磁盘IO操作将会严重影响redis高性能。一般redis是每隔1s执行一次fsync操作,周期可以配置,或者也可以永不执行,让操作系统来进行调度,也可以每个指令执行一次。

  Kafka磁盘缓存

  Kafka中大量使用了页缓存,这是Kafka实现高吞吐的重要因素之一 。用过Java的都知道两点事实:

  对象的内存开销非常大,通常会是真实数据大小的几倍甚至更多,空间使用率低下。Java的垃圾回收会随着堆内数据的增多而变得越来越慢。

  基于这些因素,使用文件系统并依赖于页缓存的做法明显要优于维护一个进程内缓存或其他结构,至少我们可以省去了一份进程内部的缓存消耗,同时还可以通过结构紧凑的字节码来替代使用对象的方式以节省更多的空间。如此,我们可以在32GB的机器上使用28GB至30GB的内存而不用担心GC所带来的性能问题。此外,即使Kafka服务重启,页缓存还是会保持有效,然而进程内的缓存却需要重建。这样也极大地简化了代码逻辑,因为维护页缓存和文件之间的一致性交由操作系统来负责,这样会比进程内维护更加安全有效。

  换个角度看,Kafka其实也是一种数据库,生产者就是在insert数据,而消费者就是在select数据,唯一与磁盘缓存进行交互就是borker,borker将生产的数据直接放到缓存中,当消费数据时通过零拷贝技术将缓存中的数据放到socket进行传输,当缓存中没有所需的数据时才会加载磁盘。Kafka的使用场景大部分操作都是顺序读写,采用文件追加的方式来写入消息,即使使用磁盘,性能依旧很高。

  Kafka把topic中每个parition大文件分成多个segment小文件段,索引文件负责数据的查找,Kafka的索引文件以稀疏索引的方式构造,分为偏移量索引和时间戳索引,稀疏索引的方式能够降低索引在内存中占用率。

  Kafka只负责将消息写到系统缓存中,并不保证脏数据何时会被刷新到磁盘上,可以使用l o g . f l u s h . i n t e r v a l . m e s s a g e s 、l o g . f l u s h . i n t e r v a l . m s 等参数来控制,Kafka消息的可靠性是依赖于多副本机制,而不是由同步刷盘这种严重影响性能的行为来保障。

  zookeeper磁盘缓存

  zookeeper在内存中维护着类似于树形文件系统的节点数据模型,其中包含了整棵树的内容,所有的节点路径,节点数据等。代码中使用DataTree的数据结构来保存这些信息,底层是使用一个ConcurrentHashMap键值对结构,既然在内存中有数据必然需要在磁盘上有对应的持久化,类似于redis,zookeeper中也分为事务日志和快照数据。

  事务日志

  存放于dataLogDir配置的路径下,默认存放在dataDir,使用日志中第一条事务记录的ZXID命名,事务日志每个文件都是64MB,因为ZooKeeper 对事务日志文件的磁盘空间进行预分配,客户端的每一次事务操作,ZooKeeper 都会将其写入事务日志文件中。因此,事务日志的写入性能直接决定了ZooKeeper 服务器对事务请求的响应,文件的不断追加写入操作会触发底层磁盘IO为文件开辟新的磁盘块,为了避免磁盘Seek的频率,提高磁盘IO的效率,预先进行磁盘空间分配。当事务操作写入文件流的缓存中,需要将缓存数据强制刷入磁盘,这里可以通过forceSync参数来配置,forceSync=yes则每次事务提交的时候将写入操作同步缓存并刷盘,forceSync=no表示让系统来调度刷盘频率。

  zookeeper更新操作过程:先写事务日志,再写内存,周期性落到磁盘(刷新内存到快照文件)。事务日志的对写请求的性能影响很大,快照文件和事务日志文件分别挂在不同磁盘,保证dataLogDir所在磁盘性能良好、没有竞争者。

————————

Most components are stored on disk. However, due to the gap between CPU speed and disk speed, caching technology will be used to improve performance. Caching is simply a memory area. First, put the data read from the disk into the cache, and then directly operate the cache when querying or modifying. The data in the cache will be refreshed to the disk at a certain frequency, How to cache, how much to cache, and when to refresh affect the performance of the whole component. After reading some architecture principles of MySQL and other components, you will find that whether it is disk based MySQL database, Kafka message middleware zookeeper distributed coordination framework, or memory based redis database, they have designed a perfect implementation of data interaction between memory and disk. Make a balance between fast reading data and persistent saving data. The cache also has space and time reading rules. From the perspective of space, the data in the adjacent area of hot data will also be accessed soon, and from the perspective of time, the hot data will continue to be accessed after the first access.

MySQL disk cache (only when using InnoDB engine)

When analyzing which data MySQL caches, you can find its root, that is, the cache pool of InnoDB engine in MySQL.

Of course, InnoDB can set multiple such cache pool instances to increase the concurrency of the database. The size of the cache pool can be configured. The size of each page in the cache pool is 16kb. The cache pool is managed through the LRU algorithm. When a page in the LRU list is modified, it is called a dirty page because it is inconsistent with the data in the disk, This is because the database will brush the dirty pages back to the disk through the checkpoint mechanism, and the dirty pages will also exist in the flush list. The flush and LRU lists do not affect each other. The LRU list manages the availability of pages in the cache pool, while the flush list management page refreshes back to the disk. The number of dirty pages can be queried through the command.

The following concerns the cache associated with the disk file

Redo log cache

The redo log cache is unique to the InnoDB engine and corresponds to the reco log file. The default is 8MB. Generally, the redo log will be refreshed to the log file every second, so it does not need to be set too large. Generally, the contents of the redo log cache will be refreshed to the redo log file on the disk in the following three cases.

The master thread flushes the redo log cache to the redo log file every second. When each transaction is committed, it flushes the redo log cache to the redo log file (controlled by innodb_flush_log_at_trx_commit). When the remaining space of the redo log cache is less than 1 / 2, it flushes the redo log cache to the redo log file

Because the cache and disk data cannot be consistent in real time, in order to prevent data loss, the current transaction databases generally adopt the write ahead log strategy, that is, when the transaction is committed, write the redo log first, and then modify the page. When the data is lost due to downtime, the data can be recovered through the log to ensure the durability requirements in the transaction. For high reliability, you can set up multiple mirror log groups.

Data page index page cache

The key feature of InnoDB engine is used here. Insert / change buffer is used to operate data. InnoDB sets a primary key for each table. The primary key is the unique identifier of the row. Generally, the insertion order of row records is also the order in which the primary key is incremented. Therefore, random reading is generally not required for inserting clustered indexes, However, there will be multiple non clustered auxiliary indexes in the table. When inserting, the data pages will be stored in the order of clustered indexes. For the update of non clustered auxiliary index pages in the index pages, there will be discrete access. Such random reading will lead to performance degradation. Therefore, insert buffer is used to cache the auxiliary indexes, Then merge with the auxiliary index page according to a certain frequency.

Binary log cache

Binary logs record all operations that change the MySQL database, but do not include operations that do not modify the database itself, such as select and show. Binary logs are used for database recovery, master-slave data synchronization replication, and security audit of the information in the logs.

Note that when the transaction table storage engine is used, all uncommitted binary logs will be recorded in the cache. When the transaction is committed, the binary logs in the cache will be written to the binary log file, binlog_ cache_ Size is based on session rather than global, and the default size is 32K.

By default, binary logs are not synchronized to the disk every time they are written. You need to set sync_ The default value is 0, which means that MySQL does not control the refresh of binlog, and the file system controls the refresh of its cache. The performance is the best at this time.

Undo log cache

Undo is a logical log, which is recorded according to each row of records. It is used to help the transaction rollback and mvcc function to realize non locking reading. The undo log is stored in the shared table space through the global dynamic parameter InnoDB_ purge_ batch_ Size to set the number of undo pages to be cleaned for each purge. The default is 300

However, if the cache is used, the disk must be flushed back, and which threads are used to flush back to the disk. Step by step, you can find that there are mainly the following four threads in the MySQL background.

Master thread: it is mainly responsible for asynchronously refreshing the data in the cache pool to the disk (including page refresh, closing cache insertion, and recycling undo pages)

IO thread: mainly responsible for callback processing of requests. ((InnoDB requests use a heavily to improve processing) write, read, insert buffer, log IO thread

Purge thread: after the transaction is committed, the required undo page may not be used to recycle the undo page

Page cleaner thread: used to refresh dirty pages

The above is the data update related to the cache and disk involved in mysql, which mainly includes four kinds of log and data synchronization.

Redis disk cache

Strictly speaking, redis is different from other components. Redis natively supports using in memory, but storing data on disk can be configured. It is not necessary to persist the data. The main function of redis is to cache data, so the long-term storage of data should be done by the back-end database. The business scenario should also check redis first, If it doesn’t exist, look it up in the database. Relying too much on the data persistence of redis may cause inconsistent data returns.

There are two persistence mechanisms for redis. The first is snapshot and the second is AOF log. Snapshot is a full backup, and AOF log is a continuous incremental backup, which is a little similar to zookeeper. Snapshot is the binary serialization form of memory data, while AOP log records the instruction record text of memory data modification. AOP log will gradually become larger during long-term operation, so it will be overwritten continuously. Snapshots can be configured with a frequency, “save *”: the frequency of saving snapshots. The first indicates how long it takes, in seconds, and the second “*” indicates at least the number of write operations. When at least a certain number of write operations are performed within a certain period of time, snapshots are automatically saved, and multiple conditions can be set.

AOP log

After receiving the client instruction, redis will store the instruction in the AOF log after verification, and then execute the instruction to ensure that it can be restored to the state before the outage through the instruction replay of the AOP log after the outage. When writing to the AOP log, the content is actually written to a memory cache allocated by the kernel for the file descriptor, and then the kernel will asynchronously brush the dirty data back to the disk. The fsync instruction provided by Linux can specify that files be forcibly flushed from the cache to the disk. However, if redis calls fsync in real time for log synchronization, this disk IO operation will seriously affect the high performance of redis. Generally, redis executes fsync every 1s. The cycle can be configured or never executed. The operating system can schedule it or execute every instruction.

Kafka disk cache

Page caching is widely used in Kafka, which is one of the important factors for Kafka to achieve high throughput. Anyone who has used java knows two facts:

The memory overhead of objects is very large, usually several times or more than the real data size, and the space utilization is low. Java garbage collection will become slower and slower with the increase of data in the heap.

Based on these factors, using the file system and relying on page cache is obviously better than maintaining an in-process cache or other structure. At least we can save the cache consumption inside a process. At the same time, we can also save more space by replacing the use of objects with compact bytecode. In this way, we can use 28gb to 30GB of memory on 32GB machines without worrying about the performance problems caused by GC. In addition, even if the Kafka service is restarted, the page cache will remain valid, but the in-process cache needs to be rebuilt. This also greatly simplifies the code logic, because maintaining the consistency between page cache and files is the responsibility of the operating system, which is more safe and effective than in-process maintenance.

From another point of view, Kafka is actually a kind of database. The producer is inserting data, while the consumer is selecting data. The only interaction with the disk cache is porter. Porter directly puts the produced data into the cache. When consuming data, it puts the data in the cache into the socket for transmission through zero copy technology, The disk is loaded when there is no required data in the cache. In Kafka’s usage scenario, most operations are sequential reading and writing, and messages are written in the way of file addition. Even if the disk is used, the performance is still very high.

Kafka divides each partition large file in topic into multiple segment small file segments. The index file is responsible for data search. Kafka’s index file is constructed by sparse index, which is divided into offset index and timestamp index. Sparse index can reduce the occupation rate of index in memory.

Kafka is only responsible for writing messages to the system cache, and does not guarantee when dirty data will be flushed to disk. L o g can be used f l u s h . i n t e r v a l . m e s s a g e s 、l o g . f l u s h . i n t e r v a l . M, s and other parameters. The reliability of Kafka messages depends on the multi copy mechanism, rather than the behavior of synchronous disk brushing, which seriously affects the performance.

Zookeeper disk cache

Zookeeper maintains a node data model similar to the tree file system in memory, which contains the contents of the whole tree, all node paths, node data and so on. In the code, the datatree data structure is used to save these information. At the bottom, a concurrenthashmap key value pair structure is used. Since there is data in memory, there must be corresponding persistence on disk. Similar to redis, zookeeper is also divided into transaction log and snapshot data.

Transaction log

It is stored in the path configured by datalogdir. By default, it is stored in dataDir. It is named after the zxid of the first transaction record in the log. Each transaction log file is 64MB. Because zookeeper pre allocates the disk space of the transaction log file, zookeeper will write it to the transaction log file for each transaction operation of the client. Therefore, the write performance of the transaction log directly determines the response of the zookeeper server to the transaction request. The continuous additional write operation of the file will trigger the underlying disk IO to open up new disk blocks for the file. In order to avoid the frequency of disk seek and improve the efficiency of disk IO, the disk space is allocated in advance. When a transaction operation is written to the cache of the file stream, the cached data needs to be forced to be flushed to the disk. This can be configured through the forcesync parameter. Forcesync = yes, the write operation will be synchronized to the cache and flushed every time the transaction is committed. Forcesync = no means that the system will schedule the frequency of flushing.

Update operation process of zookeeper: write the transaction log first, then write the memory, and periodically drop it to the disk (refresh the memory to the snapshot file). The of transaction log has a great impact on the performance of write requests. Snapshot files and transaction log files are hung on different disks to ensure that the disk where datalogdir is located has good performance and no competitors.