Hadoop VLongWritable/VIntWritable 变长编码存储源码解析

分享 青牛 ⋅ 于 2017-01-02 12:37:04 ⋅ 5245 阅读

Hadoop里面有VLongWritable和VIntWritable,对long和int进行编码存储,它们都是采用zero-compressed encoded的变长编码方式存储数值,有利于减少数据空间占用。算法大概过程是单独拿出一个字节存储数值占用的字节长度,再拿出一些字节存储数值。这样的话一个long类型就不是固定的占用8个字节了,而是可以根据数值范围压缩成1 ~ 9 个字节。大大减少了,空间占用。对一个数值 i 编码具体过程如下:

  • 对于-112 <= i <= 127的整数,只用1个字节byte来表示;如果超过上述范围时,编码第一个字节则会用来表示i的总字节数,后面则跟着 i 数值的字节;
  • 如果i大于0,则编码的第一个字节 b 范围在-113和-120之间,则 i 会有 -(112+b)个字节,所以可以表示有1-8个字节;
  • 如果i小于0,则编码第一个字节 b 范围在 -121 和 -128之间,则 i 会有 -(120+b)个字节,同样也可以表示有1-8个字节。对于负数hadoop实现的时候先进行了补码操作,然后再进行编码。

算法不复杂,比较容易理解但是实现的时候还有一些小技巧,hadoop WritableUtils 实现如下:

  /**
   * Serializes a long to a binary stream with zero-compressed encoding.
   * For -112 <= i <= 127, only one byte is used with the actual value.
   * For other values of i, the first byte value indicates whether the
   * long is positive or negative, and the number of bytes that follow.
   * If the first byte value v is between -113 and -120, the following long
   * is positive, with number of bytes that follow are -(v+112).
   * If the first byte value v is between -121 and -128, the following long
   * is negative, with number of bytes that follow are -(v+120). Bytes are
   * stored in the high-non-zero-byte-first order.
   * 
   * @param stream Binary output stream
   * @param i Long to be serialized
   * @throws java.io.IOException 
   */
  public static void writeVLong(DataOutput stream, long i) throws IOException {
    if (i >= -112 && i <= 127) {
      stream.writeByte((byte)i);
      return;
    }

    int len = -112;
    if (i < 0) {
      i ^= -1L; // take one's complement' //负数求补码
      len = -120;
    }

    long tmp = i;
    while (tmp != 0) {
      tmp = tmp >> 8;
      len--;
    }

    stream.writeByte((byte)len);

    len = (len < -120) ? -(len + 120) : -(len + 112);

    for (int idx = len; idx != 0; idx--) {
      int shiftbits = (idx - 1) * 8;
      long mask = 0xFFL << shiftbits;
      stream.writeByte((byte)((i & mask) >> shiftbits));
    }
  }

  /**
   * Reads a zero-compressed encoded long from input stream and returns it.
   * @param stream Binary input stream
   * @throws java.io.IOException 
   * @return deserialized long from stream.
   */
 public static long readVLong(DataInput stream) throws IOException {
    byte firstByte = stream.readByte();
    int len = decodeVIntSize(firstByte);
    if (len == 1) {
      return firstByte;
    }
    long i = 0;
    for (int idx = 0; idx < len-1; idx++) {
      byte b = stream.readByte();
      i = i << 8;
      i = i | (b & 0xFF);
    }
    return (isNegativeVInt(firstByte) ? (i ^ -1L) : i);
  }

  /**
   * Parse the first byte of a vint/vlong to determine the number of bytes
   * @param value the first byte of the vint/vlong
   * @return the total number of bytes (1 to 9)
   */
  public static int decodeVIntSize(byte value) {
    if (value >= -112) {
      return 1;
    } else if (value < -120) {
      return -119 - value;
    }
    return -111 - value;
  }

我们也可以字节写代码验证一下:

package com.hainiubl.hadoop.demo;

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.VLongWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;

public class VLongDemo {

    //将writable转换成字节
    public static byte[] writable2Bytes(Writable writable) throws IOException {

        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DataOutputStream dataOut = new DataOutputStream(out);
        writable.write(dataOut);
        dataOut.close();

        return out.toByteArray();

    }
    public static void main(String[] args) throws IOException {
        List<Writable> list = new ArrayList<Writable>();
        list.add(new LongWritable(100L));
        list.add(new VLongWritable(100L));
        list.add(new VLongWritable(9999L));
        list.add(new VLongWritable(9999999999L));

        byte[] bytes;
        for(Writable w : list){
            bytes = writable2Bytes(w); // 序列化成字节
            System.out.println(w.getClass().getSimpleName() + ":" + w + ",hex:" + StringUtils.byteToHexString(bytes) + ",len:" + bytes.length);
        }

    }
}

执行结果:

LongWritable:100,hex:0000000000000064,len:8
VLongWritable:100,hex:64,len:1
VLongWritable:9999,hex:8e270f,len:3
VLongWritable:9999999999,hex:8b02540be3ff,len:6

可以看出VLongWritable 压缩掉前面填充的 0 ,节省了空间。

版权声明:原创作品,允许转载,转载时务必以超链接的形式表明出处和作者信息。否则将追究法律责任。来自海汼部落-青牛,http://hainiubl.com/topics/27
本帖由 青牛 于 6年前 解除加精
点赞
成为第一个点赞的人吧 :bowtie:
回复数量: 0
    暂无评论~~
    • 请注意单词拼写,以及中英文排版,参考此页
    • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`, 更多语法请见这里 Markdown 语法
    • 支持表情,可用Emoji的自动补全, 在输入的时候只需要 ":" 就可以自动提示了 :metal: :point_right: 表情列表 :star: :sparkles:
    • 上传图片, 支持拖拽和剪切板黏贴上传, 格式限制 - jpg, png, gif,教程
    • 发布框支持本地存储功能,会在内容变更时保存,「提交」按钮点击时清空
    Ctrl+Enter