Hadoop VLongWritable/VIntWritable 变长编码存储源码解析

Hadoop里面有VLongWritable和VIntWritable，对long和int进行编码存储，它们都是采用zero-compressed encoded的变长编码方式存储数值，有利于减少数据空间占用。算法大概过程是单独拿出一个字节存储数值占用的字节长度，再拿出一些字节存储数值。这样的话一个long类型就不是固定的占用8个字节了，而是可以根据数值范围压缩成1 ～ 9 个字节。大大减少了，空间占用。对一个数值 i 编码具体过程如下：

对于-112 <= i <= 127的整数，只用1个字节byte来表示；如果超过上述范围时，编码第一个字节则会用来表示i的总字节数，后面则跟着 i 数值的字节；
如果i大于0，则编码的第一个字节 b 范围在-113和-120之间，则 i 会有 -(112+b)个字节，所以可以表示有1-8个字节；
如果i小于0，则编码第一个字节 b 范围在 -121 和 -128之间，则 i 会有 -(120+b)个字节，同样也可以表示有1-8个字节。对于负数hadoop实现的时候先进行了补码操作，然后再进行编码。

算法不复杂，比较容易理解但是实现的时候还有一些小技巧，hadoop WritableUtils 实现如下：

  /**
   * Serializes a long to a binary stream with zero-compressed encoding.
   * For -112 <= i <= 127, only one byte is used with the actual value.
   * For other values of i, the first byte value indicates whether the
   * long is positive or negative, and the number of bytes that follow.
   * If the first byte value v is between -113 and -120, the following long
   * is positive, with number of bytes that follow are -(v+112).
   * If the first byte value v is between -121 and -128, the following long
   * is negative, with number of bytes that follow are -(v+120). Bytes are
   * stored in the high-non-zero-byte-first order.
   * 
   * @param stream Binary output stream
   * @param i Long to be serialized
   * @throws java.io.IOException 
   */
  public static void writeVLong(DataOutput stream, long i) throws IOException {
    if (i >= -112 && i <= 127) {
      stream.writeByte((byte)i);
      return;
    }

    int len = -112;
    if (i < 0) {
      i ^= -1L; // take one's complement' //负数求补码
      len = -120;
    }

    long tmp = i;
    while (tmp != 0) {
      tmp = tmp >> 8;
      len--;
    }

    stream.writeByte((byte)len);

    len = (len < -120) ? -(len + 120) : -(len + 112);

    for (int idx = len; idx != 0; idx--) {
      int shiftbits = (idx - 1) * 8;
      long mask = 0xFFL << shiftbits;
      stream.writeByte((byte)((i & mask) >> shiftbits));
    }
  }

  /**
   * Reads a zero-compressed encoded long from input stream and returns it.
   * @param stream Binary input stream
   * @throws java.io.IOException 
   * @return deserialized long from stream.
   */
 public static long readVLong(DataInput stream) throws IOException {
    byte firstByte = stream.readByte();
    int len = decodeVIntSize(firstByte);
    if (len == 1) {
      return firstByte;
    }
    long i = 0;
    for (int idx = 0; idx < len-1; idx++) {
      byte b = stream.readByte();
      i = i << 8;
      i = i | (b & 0xFF);
    }
    return (isNegativeVInt(firstByte) ? (i ^ -1L) : i);
  }

  /**
   * Parse the first byte of a vint/vlong to determine the number of bytes
   * @param value the first byte of the vint/vlong
   * @return the total number of bytes (1 to 9)
   */
  public static int decodeVIntSize(byte value) {
    if (value >= -112) {
      return 1;
    } else if (value < -120) {
      return -119 - value;
    }
    return -111 - value;
  }

我们也可以字节写代码验证一下：

package com.hainiubl.hadoop.demo;

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.VLongWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;

public class VLongDemo {

    //将writable转换成字节
    public static byte[] writable2Bytes(Writable writable) throws IOException {

        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DataOutputStream dataOut = new DataOutputStream(out);
        writable.write(dataOut);
        dataOut.close();

        return out.toByteArray();

    }
    public static void main(String[] args) throws IOException {
        List<Writable> list = new ArrayList<Writable>();
        list.add(new LongWritable(100L));
        list.add(new VLongWritable(100L));
        list.add(new VLongWritable(9999L));
        list.add(new VLongWritable(9999999999L));

        byte[] bytes;
        for(Writable w : list){
            bytes = writable2Bytes(w); // 序列化成字节
            System.out.println(w.getClass().getSimpleName() + ":" + w + ",hex:" + StringUtils.byteToHexString(bytes) + ",len:" + bytes.length);
        }

    }
}

执行结果：

LongWritable:100,hex:0000000000000064,len:8
VLongWritable:100,hex:64,len:1
VLongWritable:9999,hex:8e270f,len:3
VLongWritable:9999999999,hex:8b02540be3ff,len:6

可以看出VLongWritable 压缩掉前面填充的 0 ，节省了空间。

作者：青牛

青牛的其他话题

分类下其他主题

随机推荐话题

Hadoop VLongWritable/VIntWritable 变长编码存储源码解析

添加附言

作者：青牛

青牛 的其他话题

分类下其他主题

随机推荐话题

青牛的其他话题