记录一下雪花算法的原理and Java实现(Record the principle and Java implementation of snowflake algorithm)

1.基本了解:

SnowFlake 算法,是 Twitter 开源的分布式 id 生成算法。其核心思想就是:使用一个 64 bit 的 long 型的数字作为全局唯一 id。

在分布式系统中的应用十分广泛,且ID 引入了时间戳,基本上保持自增的,后面的代码中有详细的注解。 2.解释:这 64 个 bit 中,其中 1 个 bit 是不用的,然后用其中的 41 bit 作为毫秒数,用 10 bit 作为工作机器 id,12 bit 作为序列号。如:

0 0001100 10100011 10111110 10001001 00 10001 1 1001 0000 00000000

比如下面那个 64 bit 的 long 型数字:    第一个部分,是 1 个 bit:0,这个是无意义的。    第二个部分是 41 个 bit:表示的是时间戳。    第三个部分是 5 个 bit:表示的是机房 id,10001。    第四个部分是 5 个 bit:表示的是机器 id,1 1001。    第五个部分是 12 个 bit:表示的序号,就是某个机房某台机器上这一毫秒内同时生成的 id 的序号,0000 00000000。

问题:1 bit:是不用的,为啥呢?

因为二进制里第一个 bit 为如果是 1,那么都是负数,但是我们生成的 id 都是正数,所以第一个 bit 统一都是 0。

41 bit:表示的是时间戳,单位是毫秒。

41 bit 可以表示的数字多达 2^41 – 1,也就是可以标识 2 ^ 41 – 1 个毫秒值,换算成年就是表示 69 年的时间。

10 bit:记录工作机器 id,代表的是这个服务最多可以部署在 2^10 台机器上,也就是 1024 台机器。

10 bit 里 5 个 bit 代表机房 id,5 个 bit 代表机器 id。意思就是最多代表 2 ^ 5 个机房(32 个机房),每个机房里可以代表 2 ^ 5 个机器(32 台机器),也可以根据自己公司的实际情况确定。

12 bit:这个是用来记录同一个毫秒内产生的不同 id。

12 bit 可以代表的最大正整数是 2 ^ 12 – 1 = 4096,也就是说可以用这个 12 bit 代表的数字来区分同一个毫秒内的 4096 个不同的 id。

简单来说,你的某个服务假设要生成一个全局唯一 id,那么就可以发送一个请求给部署了 SnowFlake 算法的系统,由这个 SnowFlake 算法系统来生成唯一 id。这个 SnowFlake 算法系统首先肯定是知道自己所在的机房和机器的,比如机房 id = 17,机器 id = 12。接着 SnowFlake 算法系统接收到这个请求之后,首先就会用二进制位运算的方式生成一个 64 bit 的 long 型 id,64 个 bit 中的第一个 bit 是无意义的。接着 41 个 bit,就可以用当前时间戳(单位到毫秒),然后接着 5 个 bit 设置上这个机房 id,还有 5 个 bit 设置上机器 id。最后再判断一下,当前这台机房的这台机器上这一毫秒内,这是第几个请求,给这次生成 id 的请求累加一个序号,作为最后的 12 个 bit。最终一个 64 个 bit 的 id 就出来了,类似于:0 0001100 10100011 10111110 10001001 00 10001 1 1001 0000 00000000

***********

这个算法可以保证说,一个机房的一台机器上,在同一毫秒内,生成了一个唯一的 id。可能一个毫秒内会生成多个 id,但是有最后 12 个 bit 的序号来区分开来。下面我们简单看看这个 SnowFlake 算法的一个代码实现,这就是个示例,大家如果理解了这个意思之后,以后可以自己尝试改造这个算法。总之就是用一个 64 bit 的数字中各个 bit 位来设置不同的标志位,区分每一个 id

SnowFlake JAVA算法的实现代码如下:

public class IdWorker{

    //下面两个每个5位,加起来就是10位的工作机器id
    private long workerId;    //工作id
    private long datacenterId;   //数据id
    //12位的序列号
    private long sequence;

    public IdWorker(long workerId, long datacenterId, long sequence){
        // sanity check for workerId
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0",maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0",maxDatacenterId));
        }
        System.out.printf("worker starting. timestamp left shift %d, datacenter id bits %d, worker id bits %d, sequence bits %d, workerid %d",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId);

        this.workerId = workerId;
        this.datacenterId = datacenterId;
        this.sequence = sequence;
    }

    //初始时间戳
    private long twepoch = 1288834974657L;

    //长度为5位
    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;
    //最大值
    private long maxWorkerId = -1L ^ (-1L << workerIdBits);
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    //序列号id长度
    private long sequenceBits = 12L;
    //序列号最大值
    private long sequenceMask = -1L ^ (-1L << sequenceBits);
    
    //工作id需要左移的位数,12位
    private long workerIdShift = sequenceBits;
   //数据id需要左移位数 12+5=17位
    private long datacenterIdShift = sequenceBits + workerIdBits;
    //时间戳需要左移位数 12+5+5=22位
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    
    //上次时间戳,初始值为负数
    private long lastTimestamp = -1L;

    public long getWorkerId(){
        return workerId;
    }

    public long getDatacenterId(){
        return datacenterId;
    }

    public long getTimestamp(){
        return System.currentTimeMillis();
    }

     //下一个ID生成算法
    public synchronized long nextId() {
        long timestamp = timeGen();

        //获取当前时间戳如果小于上次时间戳,则表示时间戳获取出现异常
        if (timestamp < lastTimestamp) {
            System.err.printf("clock is moving backwards.  Rejecting requests until %d.", lastTimestamp);
            throw new RuntimeException(String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds",
                    lastTimestamp - timestamp));
        }

        //获取当前时间戳如果等于上次时间戳(同一毫秒内),则在序列号加一;否则序列号赋值为0,从0开始。
        if (lastTimestamp == timestamp) {
            sequence = (sequence + 1) & sequenceMask;
            if (sequence == 0) {
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0;
        }
        
        //将上次时间戳值刷新
        lastTimestamp = timestamp;

        /**
          * 返回结果:
          * (timestamp - twepoch) << timestampLeftShift) 表示将时间戳减去初始时间戳,再左移相应位数
          * (datacenterId << datacenterIdShift) 表示将数据id左移相应位数
          * (workerId << workerIdShift) 表示将工作id左移相应位数
          * | 是按位或运算符,例如:x | y,只有当x,y都为0的时候结果才为0,其它情况结果都为1。
          * 因为个部分只有相应位上的值有意义,其它位上都是0,所以将各部分的值进行 | 运算就能得到最终拼接好的id
        */
        return ((timestamp - twepoch) << timestampLeftShift) |
                (datacenterId << datacenterIdShift) |
                (workerId << workerIdShift) |
                sequence;
    }

    //获取时间戳,并与上次时间戳比较
    private long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    //获取系统时间戳
    private long timeGen(){
        return System.currentTimeMillis();
    }

    //---------------Test---------------
    public static void main(String[] args) {
        IdWorker worker = new IdWorker(1,1,1);
        for (int i = 0; i < 30; i++) {
            System.out.println(worker.nextId());
        }
    }

}

SnowFlake算法的优点:(1)高性能高可用:生成时不依赖于数据库,完全在内存中生成。(2)容量大:每秒中能生成数百万的自增ID。(3)ID自增:存入数据库中,索引效率高。SnowFlake算法的缺点:依赖与系统时间的一致性,如果系统时间被回调,或者改变,可能会造成id冲突或者重复。实际中我们的机房并没有那么多,我们可以改进改算法,将10bit的机器id优化,成业务表或者和我们系统相关的业务。

————————

1. Basic understanding:

Snowflake algorithm is an open source distributed ID generation algorithm for twitter. The core idea is to use a 64 bit long number as the globally unique ID.

It is widely used in distributed systems, and ID introduces timestamp, which basically keeps self increasing. There are detailed comments in the following code.   2. Explanation: among the 64 bits, one bit is not used, and then 41 bit is used as the number of milliseconds, 10 bit is used as the working machine ID, and 12 bit is used as the serial number. For example:

0 0001100 10100011 10111110 10001001 00 10001 1 1001 0000 00000000

For example, the following 64 bit long number:     The first part is a bit: 0, which is meaningless.     The second part is 41 bits: it represents the timestamp.     The third part is five bits: it represents the machine room ID, 10001.     The fourth part is 5 bits: it represents the machine ID, 1 1001.     The fifth part is a 12 bit serial number, which refers to the serial number of IDs generated simultaneously within one millisecond on a machine in a machine room, 0000 0000000.

< strong > question: < strong > 1 bit: No, why

Because if the first bit in the binary is 1, it is all negative, but the IDS we generate are all positive, so the first bit is 0.

< strong > 41 bit: indicates the time stamp in milliseconds

41 bit can represent up to 2 ^ 41 – 1, that is, it can identify 2 ^ 41 – 1 milliseconds. Conversion to adulthood means 69 years.

< strong > 10 bit: record the working machine ID, which means that the service can be deployed on 2 ^ 10 machines at most, that is, 1024 machines

In 10 bits, 5 bits represent the machine room ID and 5 bits represent the machine ID. It means that it can represent up to 2 ^ 5 machine rooms (32 machine rooms), and each machine room can represent 2 ^ 5 machines (32 machines), which can also be determined according to the actual situation of your company.

< strong > 12 bit: This is used to record different IDs generated in the same millisecond

The maximum positive integer that 12 bits can represent is 2 ^ 12 – 1 = 4096, that is, 4096 different IDS in the same millisecond can be distinguished by the number represented by 12 bits.

Simply put, if a service of yours is supposed to generate a globally unique ID, it can send a request to the system deployed with the snowflake algorithm, and the snowflake algorithm system will generate the unique ID. The snowflake algorithm system must first know its own computer room and machine, such as computer room id = 17 and machine id = 12. Then, after receiving the request, the snowflake algorithm system will first generate a 64 bit long ID by binary bit operation. The first bit of the 64 bits is meaningless. After 41 bits, you can use the current timestamp (in milliseconds), then set the machine room ID for 5 bits, and set the machine ID for 5 bits. Finally, judge the number of requests on the machine in the current machine room in one millisecond. Add a sequence number to the request for ID generation as the last 12 bits. Finally, a 64 bit ID is displayed, similar to 0 0001100 10100011 10111110 10001001 00 10001 1 1001 0000 00000000

***********

This algorithm can guarantee that a unique ID is generated on a machine in a computer room within the same millisecond. Multiple IDs may be generated in a millisecond, but they are distinguished by the sequence number of the last 12 bits. Let’s take a brief look at a code implementation of the snowflake algorithm. This is an example. If you understand this meaning, you can try to transform the algorithm yourself in the future. In short, each bit of a 64 bit number is used to set different flag bits to distinguish each ID

< strong > the implementation code of snowflake Java algorithm is as follows: < / strong >

public class IdWorker{

    //下面两个每个5位,加起来就是10位的工作机器id
    private long workerId;    //工作id
    private long datacenterId;   //数据id
    //12位的序列号
    private long sequence;

    public IdWorker(long workerId, long datacenterId, long sequence){
        // sanity check for workerId
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0",maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0",maxDatacenterId));
        }
        System.out.printf("worker starting. timestamp left shift %d, datacenter id bits %d, worker id bits %d, sequence bits %d, workerid %d",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId);

        this.workerId = workerId;
        this.datacenterId = datacenterId;
        this.sequence = sequence;
    }

    //初始时间戳
    private long twepoch = 1288834974657L;

    //长度为5位
    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;
    //最大值
    private long maxWorkerId = -1L ^ (-1L << workerIdBits);
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    //序列号id长度
    private long sequenceBits = 12L;
    //序列号最大值
    private long sequenceMask = -1L ^ (-1L << sequenceBits);
    
    //工作id需要左移的位数,12位
    private long workerIdShift = sequenceBits;
   //数据id需要左移位数 12+5=17位
    private long datacenterIdShift = sequenceBits + workerIdBits;
    //时间戳需要左移位数 12+5+5=22位
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    
    //上次时间戳,初始值为负数
    private long lastTimestamp = -1L;

    public long getWorkerId(){
        return workerId;
    }

    public long getDatacenterId(){
        return datacenterId;
    }

    public long getTimestamp(){
        return System.currentTimeMillis();
    }

     //下一个ID生成算法
    public synchronized long nextId() {
        long timestamp = timeGen();

        //获取当前时间戳如果小于上次时间戳,则表示时间戳获取出现异常
        if (timestamp < lastTimestamp) {
            System.err.printf("clock is moving backwards.  Rejecting requests until %d.", lastTimestamp);
            throw new RuntimeException(String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds",
                    lastTimestamp - timestamp));
        }

        //获取当前时间戳如果等于上次时间戳(同一毫秒内),则在序列号加一;否则序列号赋值为0,从0开始。
        if (lastTimestamp == timestamp) {
            sequence = (sequence + 1) & sequenceMask;
            if (sequence == 0) {
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0;
        }
        
        //将上次时间戳值刷新
        lastTimestamp = timestamp;

        /**
          * 返回结果:
          * (timestamp - twepoch) << timestampLeftShift) 表示将时间戳减去初始时间戳,再左移相应位数
          * (datacenterId << datacenterIdShift) 表示将数据id左移相应位数
          * (workerId << workerIdShift) 表示将工作id左移相应位数
          * | 是按位或运算符,例如:x | y,只有当x,y都为0的时候结果才为0,其它情况结果都为1。
          * 因为个部分只有相应位上的值有意义,其它位上都是0,所以将各部分的值进行 | 运算就能得到最终拼接好的id
        */
        return ((timestamp - twepoch) << timestampLeftShift) |
                (datacenterId << datacenterIdShift) |
                (workerId << workerIdShift) |
                sequence;
    }

    //获取时间戳,并与上次时间戳比较
    private long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    //获取系统时间戳
    private long timeGen(){
        return System.currentTimeMillis();
    }

    //---------------Test---------------
    public static void main(String[] args) {
        IdWorker worker = new IdWorker(1,1,1);
        for (int i = 0; i < 30; i++) {
            System.out.println(worker.nextId());
        }
    }

}

Advantages of snowflake algorithm: (1) high performance and high availability: it does not depend on the database and is completely generated in memory. (2) Large capacity: millions of self increasing IDS can be generated per second. (3) ID self increment: stored in the database, with high index efficiency. Disadvantages of snowflake algorithm: it depends on the consistency with the system time. If the system time is recalled or changed, it may cause ID conflict or duplication. In fact, there are not so many computer rooms. We can improve the algorithm to optimize the 10bit machine ID into a business table or business related to our system< strong>