2013年6月5日    


总的来说,Redis支持的将其数据库里面的KV数据存储到磁盘,但可能会有短时间的丢失。官网关于持久化的介绍可以参考这里“Redis Persistence”,这篇文章介绍一下其在代码层面的实现。


  1. 第一种RDB数据快照持久化。RDB持久化实际上就是对数据库内容做快照,然后将快照存储到磁盘上面,这样就要去我们进行周期性的做快照,但是这种方式无法做到实时的存储,出现故障时只能恢复上一次做快照时的状态,因此比较有限。不过redis的主从同步也是利用RDB实现的,这个我们后续文章分析;
  2. 第二种AOF日志实时持久化。AOF=Append Only File,也就是不断追加写的文件。在这种情况下,Redis首先将数据库做个快照,将数据还原为跟客户端的协议格式的文本数据,然后将其存储到一个临时文件中,然后将其覆盖成正常的aof文件,并把这个过程中新增的命令追加到aof文件后面,从此之后,后续的从客户端过来的命令都会不断根据不同的安全级别写到磁盘里面去。这样就支持了实时的持久化,只是可能会有短时间内的数据丢失,对一般系统还是可以容忍的。



当redis启动时,如果打开了aof开关,也就是配置了:"appendonly on",那么就会从"appendfilename"指令指定的文件中加载数据库数据进行初始化;其调用流程为:

main()->loadDataFromDisk(),后者会判断server.aof_state == REDIS_AOF_ON,如果是,就调用loadAppendOnlyFile函数去加载数据文件的数据,加载的方法就是把文件内容读出来当客户端请求一样调用各个命令的cmd->proc(fakeClient);还原数据,其实就是进行操作重放。
如果没有配置"appendonly on",那么redis就会从RDB文件中加载数据。

/* Function called at startup to load RDB or AOF file in memory. */
void loadDataFromDisk(void) {
    long long start = ustime();
    if (server.aof_state == REDIS_AOF_ON) {
        if (loadAppendOnlyFile(server.aof_filename) == REDIS_OK)
            redisLog(REDIS_NOTICE,"DB loaded from append only file: %.3f seconds",(float)(ustime()-start)/1000000);
    } else {
        if (rdbLoad(server.rdb_filename) == REDIS_OK) {
            redisLog(REDIS_NOTICE,"DB loaded from disk: %.3f seconds",
        } else if (errno != ENOENT) {
            redisLog(REDIS_WARNING,"Fatal error loading the DB: %s. Exiting.",strerror(errno));



void aeMain(aeEventLoop *eventLoop) {
    eventLoop->stop = 0;
    while (!eventLoop->stop) {//stop ==1 停止服务
        if (eventLoop->beforesleep != NULL)

        aeProcessEvents(eventLoop, AE_ALL_EVENTS);//处理各种事件。


/* Call() is the core of Redis execution of a command */
void call(redisClient *c, int flags) {
    /* Call the command. */
    dirty = server.dirty;
    dirty = server.dirty-dirty;
    duration = ustime()-start;
    /* Propagate the command into the AOF and replication link */
    if (flags & REDIS_CALL_PROPAGATE) {
        int flags = REDIS_PROPAGATE_NONE;

        if (c->cmd->flags & REDIS_CMD_FORCE_REPLICATION)
            flags |= REDIS_PROPAGATE_REPL;
        if (dirty)//
        if (flags != REDIS_PROPAGATE_NONE)
/* Propagate the specified command (in the context of the specified database id)
* to AOF and Slaves.
* flags are an xor between:
* + REDIS_PROPAGATE_NONE (no propagation of command at all)
* + REDIS_PROPAGATE_AOF (propagate into the AOF file if is enabled)
* + REDIS_PROPAGATE_REPL (propagate into the replication link)
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
int flags)
if (server.aof_state != REDIS_AOF_OFF && flags & REDIS_PROPAGATE_AOF)

if (flags & REDIS_PROPAGATE_REPL && listLength(server.slaves))

call指令调用propagate函数,后者判断server.aof_state != REDIS_AOF_OFF的时候,就会进入去准备AOF的相关数据,也就是调用feedAppendOnlyFile,去将客户端传递的参数,数据转换为aof缓冲区,存起来。


void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc) {
	//将这条指令还原成字符串表示,然后将其追加到server.aof_buf 字符串后面,
    sds buf = sdsempty();
    robj *tmpargv[3];

    /* The DB this command was targeting is not the same as the last command
     * we appendend. To issue a SELECT command is needed. */
    if (dictid != server.aof_selected_db) {
		//如果当前选择的库不是目标库,则在指令前面插入一个SELECT db的指令。
        char seldb[64];

        buf = sdscatprintf(buf,"*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
            (unsigned long)strlen(seldb),seldb);
        server.aof_selected_db = dictid;//修改当前选择的db

    if (cmd->proc == expireCommand || cmd->proc == pexpireCommand ||
        cmd->proc == expireatCommand) {
        /* Translate EXPIRE/PEXPIRE/EXPIREAT into PEXPIREAT */
        buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
    } else if (cmd->proc == setexCommand || cmd->proc == psetexCommand) {
        /* Translate SETEX/PSETEX to SET and PEXPIREAT */
        tmpargv[0] = createStringObject("SET",3);
        tmpargv[1] = argv[1];
        tmpargv[2] = argv[3];
        buf = catAppendOnlyGenericCommand(buf,3,tmpargv);
        buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
    } else {
        /* All the other commands don't need translation or need the
         * same translation already operated in the command vector
         * for the replication itself. */
        buf = catAppendOnlyGenericCommand(buf,argc,argv);

准备好客户端这条请求的数据缓冲后,就需要将数据保存起来。这里有2部分需要关注,aof_buf实时增量数据缓存 和aof_rewrite_buf_blocks快照保存期间的DIFF数据缓存。


这个是做什么用的呢?只要server.aof_state == REDIS_AOF_ON,也就是AOF是常规打开的,既没有关闭,也不是在快照过程中,那么我们将这条客户端数据放到aof_buf的后面,不断追加。如下:

    /* Append to the AOF buffer. This will be flushed on disk just before
     * of re-entering the event loop, so before the client will get a
     * positive reply about the operation performed. */
    if (server.aof_state == REDIS_AOF_ON)
        server.aof_buf = sdscatlen(server.aof_buf,buf,sdslen(buf));



这里redis为了提供不同的安全级别,支持最多每秒fsync刷新,每次写都刷新,或者不主动fsync,但是fsync会降低性能,所以看具体应用考虑,是通过配置appendfsync no/everysec/always来控制的,代码里面根据下面三个宏去吧,存放在server.aof_fsync。

#define AOF_FSYNC_NO 0




从这里我们可以看到,redis提供最长2秒的数据丢失保证。我们可以看看作者关于AOF的解释“Redis persistence demystified”中可以看到:

appendfsync everysec

In this configuration data will be both written to the file using write(2) and flushed from the kernel to the disk using fsync(2) one time every second. Usually the write(2) call will actually be performed every time we return to the event loop, but this is not guaranteed.

However if the disk can't cope with the write speed, and the background fsync(2) call is taking longer than 1 second, Redis may delay the write up to an additional second (in order to avoid that the write will block the main thread because of an fsync(2) running in the background thread against the same file descriptor). If a total of two seconds elapsed without that fsync(2) was able to terminate, Redis finally performs a (likely blocking) write(2) to transfer data to the disk at any cost.

So in this mode Redis guarantees that, in the worst case, within 2 seconds everything you write is going to be committed to the operating system buffersand transfered to the disk. In the average case data will be committed every second.


/* Write the append only file buffer on disk.
 * Since we are required to write the AOF before replying to the client,
 * and the only way the client socket can get a write is entering when the
 * the event loop, we accumulate all the AOF writes in a memory
 * buffer and write it on disk using this function just before entering
 * the event loop again.
 * About the 'force' argument:
 * When the fsync policy is set to 'everysec' we may delay the flush if there
 * is still an fsync() going on in the background thread, since for instance
 * on Linux write(2) will be blocked by the background fsync anyway.
 * When this happens we remember that there is some aof buffer to be
 * flushed ASAP, and will try to do that in the serverCron() function.
 * However if force is set to 1 we'll write regardless of the background
 * fsync. */
void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;

    if (sdslen(server.aof_buf) == 0) return;
    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = bioPendingJobsOfType(REDIS_BIO_AOF_FSYNC) != 0;

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {//如果不是强制刷新,就
        /* With this append fsync policy we do background fsyncing.
         * If the fsync is still in progress we can try to delay
         * the write for a couple of seconds. */
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {//还没开始,这是第一次进入。
                /* No previous write postponinig, remember that we are
                 * postponing the flush and return. */
                server.aof_flush_postponed_start = server.unixtime;
            } else if (erver.unixtime - server.aof_flush_postponed_start < 2) {                 /* We were already waiting for fsync to finish, but for less                  * than two seconds this is still ok. Postpone again. */                 return;             }             /* Otherwise fall trough, and go write since we can't wait              * over two seconds. */             server.aof_delayed_fsync++;             redisLog(REDIS_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");         }     } //到这里后,肯定是等的超过了2秒,或者后面没有进程在刷新AOF.     /* If you are following this code path, then we are going to write so      * set reset the postponed flush sentinel to zero. */     server.aof_flush_postponed_start = 0; 

走完上面的代码后,下面就剩下write,fsync了。由于redis是直接将所有这一批的命令放入server.aof_buf字符串的,所以一次write()函数调用就行了,如果调用写入的字节数不等于总大小,则exit(1)退出程序,够猛。 write完成后增加server.aof_current_size的大小,这个是用来做自动AOF rewrite时判断的,走个题,说下AOF rewrite,我们知道,如果一直这样AOF下去,把所有客户端命令都重放到AOF文件内,势必导致AOF文件非常大,不断增大,而且可能会有很多重复的无用命令,所以我们需要定期的将AOF文件进行覆盖,用最新的快照覆盖它,这样就能有效减少文件大小。这个操作是在serverCron定时任务里面做的。具体看下面代码:

 int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
 /* Trigger an AOF rewrite if needed */
if (server.rdb_child_pid == -1 &&
server.aof_child_pid == -1 &&
server.aof_rewrite_perc &&
server.aof_current_size > server.aof_rewrite_min_size)
            long long base = server.aof_rewrite_base_size ? server.aof_rewrite_base_size : 1;
            long long growth = (server.aof_current_size*100/base) - 100;
	//如果AOF文件增长超过了指定百分比,那么需要自动rewrite aof文件了
            if (growth >= server.aof_rewrite_perc) {
                redisLog(REDIS_NOTICE,"Starting automatic rewriting of AOF on %lld%% growth",growth);



    /* We want to perform a single write. This should be guaranteed atomic
     * at least if the filesystem we are writing is a real physical one.
     * While this will save us against the server being killed I don't think
     * there is much to do about the whole server stopping for power problems
     * or alike */
    nwritten = write(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
    if (nwritten != (signed)sdslen(server.aof_buf)) {
	//统计AOF文件的大小,用来判断是否需要自动AOF rewrite文件了
    server.aof_current_size += nwritten;

    /* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
     * children doing I/O in the background. */
    if (server.aof_no_fsync_on_rewrite &&
        (server.aof_child_pid != -1 || server.rdb_child_pid != -1))
    /* Perform the fsync if needed. */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* aof_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
        aof_fsync(server.aof_fd); /* Let's try to get this data on the disk */
        server.aof_last_fsync = server.unixtime;
    } else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
                server.unixtime > server.aof_last_fsync)) {
        if (!sync_in_progress)
        server.aof_last_fsync = server.unixtime;

还剩下一个问题:AOF rewrite,即重新AOF,缩短文件大小。限于篇幅在后面一篇文章介绍。

