PageとRecord - mir the developer

昨夜は20時半に沈没、そして今朝は4時半起床ｗ　軽い運動とシャワーも済んだところでなにしよーか、ということで引き続きInnoDBのPageとRecordのソース解析。今月はInnoDB強化月間ですｗ　先月は勉強会ネタ作成を兼ねてJBossのソースばかり読んでましたが今月はInnoDBをやります。

まずは現状整理から。

とりあえず読むべきソース

innobase/include/univ.i
innobase/include/rem0rec.h
innobase/include/rem0rec.ic
innobase/rem/rem0rec.c

この"rem"ってのはRecord Managerの略っす。"rec"がrecordの略ってのはいいっすよね・・・。

InnoDBのレコード構造等の話は過去開催されたカンファレンス資料とか、某M氏による某勉強会での発表(2年前?)とか、MySQL Internals Manual(http://dev.mysql.com/doc/internals/en/innodb-record-structure.html)とかちらほら触れられてきた話題だが、ここいらで一度整理する必要があると思ってます。

というのも、InnoDBはいつ頃からかは知らないけど、レコード構造のアップグレードが行われて、以前のものが"old style"、新しいものが"new style"として別のものとして内部管理されているからです。

実際、MySQL Internals Manualの説明はこの"old style"をベースに説明しているものに過ぎず、"new style"を知るためには新たなソース解析が必要と思われます。

といってもそんなたいした話ではなく、"rem0rec.c"のファイル先頭部分にがっつりコメントが書いてあるのでそれを読めばいいだけなんですが。

ということでまずは"old style"から掲載。from rem0rec.c

/*			PHYSICAL RECORD (OLD STYLE)
			===========================

The physical record, which is the data type of all the records
found in index pages of the database, has the following format
(lower addresses and more significant bits inside a byte are below
represented on a higher text line):

| offset of the end of the last field of data, the most significant
  bit is set to 1 if and only if the field is SQL-null,
  if the offset is 2-byte, then the second most significant
  bit is set to 1 if the field is stored on another page:
  mostly this will occur in the case of big BLOB fields |
... 
| offset of the end of the first field of data + the SQL-null bit |
| 4 bits used to delete mark a record, and mark a predefined
  minimum record in alphabetical order |
| 4 bits giving the number of records owned by this record
  (this term is explained in page0page.h) |
| 13 bits giving the order number of this record in the
  heap of the index page |
| 10 bits giving the number of fields in this record |
| 1 bit which is set to 1 if the offsets above are given in
  one byte format, 0 if in two byte format |
| two bytes giving an absolute pointer to the next record in the page |
ORIGIN of the record
| first field of data | 
... 
| last field of data |

The origin of the record is the start address of the first field 
of data. The offsets are given relative to the origin. 
The offsets of the data fields are stored in an inverted
order because then the offset of the first fields are near the 
origin, giving maybe a better processor cache hit rate in searches.

The offsets of the data fields are given as one-byte 
(if there are less than 127 bytes of data in the record) 
or two-byte unsigned integers. The most significant bit
is not part of the offset, instead it indicates the SQL-null
if the bit is set to 1. */

詳しく説明してるけどわかりにくいのでもうちょい簡潔なやつをrem0rec.icから。

/* Offsets of the bit-fields in an old-style record. NOTE! In the table the
most significant bytes and bits are written below less significant.

    (1) byte offset     (2) bit usage within byte
    downward from
    origin ->   1   8 bits pointer to next record
            2   8 bits pointer to next record
            3   1 bit short flag
                7 bits number of fields
            4   3 bits number of fields
                5 bits heap number
            5   8 bits heap number
            6   4 bits n_owned
                4 bits info bits
*/

でもって、"new style"の説明。from rem0rec.c

/*			PHYSICAL RECORD (NEW STYLE)
			===========================

The physical record, which is the data type of all the records
found in index pages of the database, has the following format
(lower addresses and more significant bits inside a byte are below
represented on a higher text line):

| length of the last non-null variable-length field of data:
  if the maximum length is 255, one byte; otherwise,
  0xxxxxxx (one byte, length=0..127), or 1exxxxxxxxxxxxxx (two bytes,
  length=128..16383, extern storage flag) |
...
| length of first variable-length field of data |
| SQL-null flags (1 bit per nullable field), padded to full bytes |
| 4 bits used to delete mark a record, and mark a predefined
  minimum record in alphabetical order |
| 4 bits giving the number of records owned by this record
  (this term is explained in page0page.h) |
| 13 bits giving the order number of this record in the
  heap of the index page |
| 3 bits record type: 000=conventional, 001=node pointer (inside B-tree),
  010=infimum, 011=supremum, 1xx=reserved |
| two bytes giving a relative pointer to the next record in the page |
ORIGIN of the record
| first field of data |
...
| last field of data |

The origin of the record is the start address of the first field
of data. The offsets are given relative to the origin.
The offsets of the data fields are stored in an inverted
order because then the offset of the first fields are near the
origin, giving maybe a better processor cache hit rate in searches.

The offsets of the data fields are given as one-byte
(if there are less than 127 bytes of data in the record)
or two-byte unsigned integers. The most significant bit
is not part of the offset, instead it indicates the SQL-null
if the bit is set to 1. */

でもってその簡潔説明版。from rem0rec.ic

/* Offsets of the bit-fields in a new-style record. NOTE! In the table the
most significant bytes and bits are written below less significant.

    (1) byte offset     (2) bit usage within byte
    downward from
    origin ->   1   8 bits relative offset of next record
            2   8 bits relative offset of next record
                  the relative offset is an unsigned 16-bit
                  integer:
                  (offset_of_next_record
                   - offset_of_this_record) mod 64Ki,
                  where mod is the modulo as a non-negative
                  number;
                  we can calculate the the offset of the next
                  record with the formula:
                  relative_offset + offset_of_this_record
                  mod UNIV_PAGE_SIZE
            3   3 bits status:
                    000=conventional record
                    001=node pointer record (inside B-tree)
                    010=infimum record
                    011=supremum record
                    1xx=reserved
                5 bits heap number
            4   8 bits heap number
            5   4 bits n_owned
                4 bits info bits
*/

と、ここまで来ると"old style"と"new style"はそれぞれどういう場合に使い分けられているのか気になるところ。それにもし"old style"が今まったく使われていないとしたら、解析する意味無いしね。

まずぱっと思い当たるのが、"row_format"にredundantを指定したらold styleになるのかどうか。row_formatが何なのか知らない人はこのへん
（http://dev.mysql.com/doc/refman/5.0/en/create-table.html）とかをみてね。

ROW_FORMAT [=] {DEFAULT|DYNAMIC|FIXED|COMPRESSED|REDUNDANT|COMPACT}

制御構造を解明して使い分けの根拠を特定するのが王道だとは思うのだけど、今はまだ制御構造は難しいのでできるところからいきますｗ

とりあずテーブルオプション'row_format=redundant'をつけてテーブルを作成します。

[test] > create table t_redundant (c1 int primary key, c2 char(20) charset latin1) 
       > engine=innodb row_format=redundant; 
 Query OK, 0 rows affected (0.01 sec)

[test] > insert into t_redundant values (1, 'aaa');
Query OK, 1 row affected (0.01 sec)

[test] > insert into t_redundant values (2, 'bbb');
Query OK, 1 row affected (0.00 sec)

[test] > insert into t_redundant values (100, 'mnb');
Query OK, 1 row affected (0.01 sec)

[test] > insert into t_redundant values (30, 'aiueokakikukeko');
Query OK, 1 row affected (0.00 sec)

[test] > insert into t_redundant values (-10, "zzz");
Query OK, 1 row affected (0.02 sec)

レコードも数件突っ込んだところで、ibdata1を"hexdump -C"します。以下にUPしました。

http://ikda.net/resource/mysql/ibdata1_hd

入力文字列"aiueo"とかを手がかりに探すと、該当Pageがヒット。

000c7ff0  00 00 00 00 00 74 00 65  10 6d c0 e7 00 00 aa 81  |.....t.e.m...|
000c8000  17 fc 76 6c 00 00 00 32  ff ff ff ff ff ff ff ff  |.vl...2|
000c8010  00 00 00 00 00 00 f3 90  45 bf 00 00 00 00 00 00  |.......E......|
000c8020  00 00 00 00 00 00 00 02  01 68 00 07 00 00 00 00  |.........h......|
000c8030  01 43 00 05 00 00 00 05  00 00 00 00 00 00 00 00  |.C..............|
000c8040  00 00 00 00 00 00 00 00  00 15 00 00 00 00 00 00  |................|
000c8050  00 02 12 f2 00 00 00 00  00 00 00 02 12 32 08 01  |............2..|
000c8060  00 00 03 01 43 69 6e 66  69 6d 75 6d 00 09 06 00  |....Cinfimum....|
000c8070  08 03 00 00 73 75 70 72  65 6d 75 6d 00 25 11 0a  |....supremum.%..|
000c8080  04 00 00 10 09 00 b6 80  00 00 01 00 00 00 00 0d  |...............|
000c8090  04 80 00 00 00 2d 01 10  61 61 61 20 20 20 20 20  |.....-..aaa     |
000c80a0  20 20 20 20 20 20 20 20  20 20 20 20 25 11 0a 04  |            %...|
000c80b0  00 00 18 09 01 14 80 00  00 02 00 00 00 00 0d 05  |................|
000c80c0  80 00 00 00 2d 01 10 62  62 62 20 20 20 20 20 20  |....-..bbb      |
000c80d0  20 20 20 20 20 20 20 20  20 20 20 25 11 0a 04 00  |           %....|
000c80e0  00 20 09 00 74 80 00 00  64 00 00 00 00 0d 06 80  |. ..t...d.......|
000c80f0  00 00 00 2d 01 10 6d 6e  62 20 20 20 20 20 20 20  |...-..mnb       |
000c8100  20 20 20 20 20 20 20 20  20 20 25 11 0a 04 00 00  |          %.....|
000c8110  28 09 00 e5 80 00 00 1e  00 00 00 00 0d 07 80 00  |(..............|
000c8120  00 00 2d 01 10 61 69 75  65 6f 6b 61 6b 69 6b 75  |..-..aiueokakiku|
000c8130  6b 65 6b 6f 20 20 20 20  20 25 11 0a 04 00 00 30  |keko     %.....0|
000c8140  09 00 87 7f ff ff f6 00  00 00 00 0f 00 80 00 00  |.............|
000c8150  00 2d 01 10 7a 7a 7a 20  20 20 20 20 20 20 20 20  |.-..zzz         |
000c8160  20 20 20 20 20 20 20 20  00 00 00 00 00 00 00 00  |        ........|
000c8170  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

さらに細かく見ていこう。まずは(1, "aaa")としてINSERTした行から。青い文字のところがCHAR(20)のところっす。

000c8070  08 03 00 00 73 75 70 72  65 6d 75 6d 00 25 11 0a  |....supremum.%..|
000c8080  04 00 00 10 09 00 b6 80  00 00 01 00 00 00 00 0d  |...............|
000c8090  04 80 00 00 00 2d 01 10  61 61 61 20 20 20 20 20  |.....-..aaa     |
000c80a0  20 20 20 20 20 20 20 20  20 20 20 20 25 11 0a 04  |            %...|

./configure時に--with-extra-charsetsを指定してなかったのでcp932とかは使えませんでしたｗ　latin1なのでCHAR1文字につき1byteです。でもって、"row_format=redundant"なのでCHAR(20)=20bytes分、きっちり場所を取ってます。

次に、緑色にした部分が(1, "aaa")として格納したrecordに該当する部分です。

000c8070  08 03 00 00 73 75 70 72  65 6d 75 6d 00 25 11 0a  |....supremum.%..|
000c8080  04 00 00 10 09 00 b6 80  00 00 01 00 00 00 00 0d  |...............|
000c8090  04 80 00 00 00 2d 01 10  61 61 61 20 20 20 20 20  |.....-..aaa     |
000c80a0  20 20 20 20 20 20 20 20  20 20 20 20 25 11 0a 04  |            %...|

でもって以下の青色の部分がc1=1に該当するところ。"signed INT"なので"80 00 00 01"です。例えばc1=-10だとこれは"7f ff ff f6"になります。次の茶色の部分がDB_TRX_IDという名前のInnoDB Specificなシステムカラム、通称"トランザクションID"です。それから橙色の部分がこれまたDB_ROLL_PTRという名前のInnoDB Specificなシステムカラム、通称"ロールポインタ"です。

000c8070  08 03 00 00 73 75 70 72  65 6d 75 6d 00 25 11 0a  |....supremum.%..|
000c8080  04 00 00 10 09 00 b6 80  00 00 01 00 00 00 00 0d  |...............|
000c8090  04 80 00 00 00 2d 01 10  61 61 61 20 20 20 20 20  |.....-..aaa     |
000c80a0  20 20 20 20 20 20 20 20  20 20 20 20 25 11 0a 04  |            %...|

今回のテーブルには主キーが設定されているのでDB_ROW_IDなる名前のシステムカラム（通称RowID）は使用されません。主キーを持たないテーブルをCREATEしていると上の青色の部分にRowIDが出現します。

#というわけでInnoDBにはOracle的RowIDという概念は存在しません。

#InnoDBにおけるレコードの特定方法および配置方法、トランザクションID、ロールポインタについてはまた後日にｗ

さてさて、前置きがいつもどおり長くなりましたが、今回見たかったのはRecord Headerの部分がrow_format=redundantのときにどうなっているのか、に関してです。

Record Headerは以下の赤色の部分です。

000c8070  08 03 00 00 73 75 70 72  65 6d 75 6d 00 25 11 0a  |....supremum.%..|
000c8080  04 00 00 10 09 00 b6 80  00 00 01 00 00 00 00 0d  |...............|
000c8090  04 80 00 00 00 2d 01 10  61 61 61 20 20 20 20 20  |.....-..aaa     |
000c80a0  20 20 20 20 20 20 20 20  20 20 20 20 25 11 0a 04  |            %...|

InnoDBのRecord Headerは逆順に読んでいきます。

b6 00 09 10 00 00 04 0a 11 25

分かりやすいところから言うと、

04 0a 11 25

これを16進数から10進数に直すと、

4, 10, 17, 37

さらに分かりやすいように書き換えると

+4, +6, +7, +20

c1→トランザクションID→ロールポインタ→c2という順番にそれぞれの読み込み終了位置がかかれてます。

ということで残りは以下ですが、

b6 00 09 10 00 00

6バイトあるので、これは"old style"ってのは予想できます。ソースコードコメントが正しければ(たまーに間違えてたりするので安心はできないっすが)、"new style"ならヘッダは5バイトのはずです。

ということで"old style"であることを前提に各ビットの意味を考えてみましょう。とりあえず2進数に。(注:ビットの本当の並びはこれとは異なります)

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

先頭の16bitはnext record pointerです。まあPage内のOffsetってところでしょうか。

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

ということでPage Offset=182となります。

次が1bitのフラグ。

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

このフラグの意味は、上述のカラムごとのレコードの開始位置に対するRecord Offsetが1バイトで記述されている(1)のか2バイトで記述されている(0)かです。ここでは1バイト記述なので1になっています。

その次が10bitのフィールド。

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

この10bitのフィールドはこのレコード(テーブル)に含まれるカラムの数を表しています。0000000100なので4つ。c1, TransactionID, RollPointer, c2の4つですね。ちなみにカラム数を表現するのに10bitのフィールドを使用しているので、InnoDBでは最大カラム数が1023になるわけです。

次は13bitのフィールド。

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

0000000000010なので値としては2ですが、これはIndexPageのヒープ領域における番号らしいです。今のところは意味不明で良いでしょう。

その次は、4bitのフィールド。

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

この4bitの意味は"the number of records owned by this record"とのことですが、これはupdateとかが発生したときにどうも変化するというくらいしか分かりませんでした。制御構造の解析を待てという感じでせうか。

最後の4bitフィールド。

1011 0110, 0000 0000, 0000 1001, 0001 0000, 0000 0000, 0000 0000

コメントによれば削除されるとマークされるとのことです。

まあまだ理解が浅いですが、ついでに読んでいたソースからもredundant=old style, compact=new styleっぽい香りがぷんぷんしてますんで、大枠としての理解はあっているかと思います。

というわけでそろそろ出勤時間になったので今日はここまで〜。今日は客先に直行なので出勤が早いのです（泪