Sunday, 26 September 2010

The Java Class File Format

Introduction

Compiled binary executables for different platforms usually differ not only in the instruction set, libraries, and APIs at which they are aimed, but also by the file format which is used to represent the program code. For instance, Windows uses the COFF file format, while Linux uses the ELF file format. Because Java aims at binary compatibility, it needs a universal file format for its programs - the Class File format.
The class file consists of some data of fixed length and lots of data of variable length and quantity, often nested inside other data of variable length. Therefore it is generally necessary to parse the whole file to read any one piece of data, because you will not know where that data is until you have worked your way through all the data before it. The JVM would just read the class file once and either store the data from the class file temporarily in a more easily accessed (but larger) format, or just remember where everything is in the class file. For this reason, a surprisingly large amount of the code of any JVM will be concerned with the interpretation, mapping, and possibly caching of this class file format.
Please note that this document is just an overview. The actual Class File format description is in the 'Java Virtual Machine Specification' which can be found online here, or in printed form here.

The Start

The Class file starts with the following bytes:
Length (number of bytes)Example
magic40xCAFEBABE
minor_version20x0003
major_version20x002D
The 'magic' bytes are always set to 0xCAFEBABE and are simply a way for the JVM to check that it has loaded a class file rather than some other set of bytes.
The version bytes identify the version of the Class File format which this file conforms to. Obviously a JVM would have trouble reading a class file format which was defined after that JVM was written. Each new version of the JVM specification generally says what range of Class File versions it should be able to process.

Constant Pool

This major_version is followed by the constant_pool_count (2 bytes), and the constant_pool table.
The constant_pool table consists of several entries which can be of various types, and therefore of variable length. There are constant_pool_count - 1 entries, and each entries is referred to by its 1-indexed position in the table. Therefore, the first item is referred to as Constant Pool item 1. An index into the Constant Pool table can be store in 2 bytes.
Each entry can be one of the following, and is identified by the tag byte at the start of the entry.
TagContents
CONSTANT_Class7The name of a class 
CONSTANT_Fieldref 9The name and type of a Field, and the class of which it is a member.
CONSTANT_Methodref 10The name and type of a Method, and the class of which it is a member.
CONSTANT_InterfaceMethodref 11The name and type of a Interface Method, and the Interface of which it is a member.
CONSTANT_String8The index of a CONSTANT_Utf8 entry.
CONSTANT_Integer34 bytes representing a Java integer.
CONSTANT_Float 44 bytes representing a Java float.
CONSTANT_Long58 bytes representing a Java long.
CONSTANT_Double68 bytes representing a Java double.
CONSTANT_NameAndType12The Name and Type entry for a field, method, or interface.
CONSTANT_Utf812 bytes for the length, then a string in Utf8 (Unicode) format.
Note that the primitive types, such as CONSTANT_Integer, are stored in big-endian format, with the most significant bits first. This is the most obvious and intuitive way of storing values, but some processors (in particular, Intel x86 processors) use values in little-endian format, so the JVM may need to manipulate these bytes to get the data into the correct form.
Many of these entries refer to other entries, but they generally end up referring to one or more Utf8 entries.
For instance, here is are the levels of containment for a CONSTANT_Fieldref entry:
CONSTANT_Fieldref
    index to a CONSTANT_Class entry
        index to a CONSTANT_Utf8 entry
    index to a CONSTANT_NameAndType entry
        index to a CONSTANT_Utf8 entry (name)
        index to a CONSTANT_Utf8 entry (type descriptor)
Note that simple text names are used to identify entities such as classes, fields, and methods. This greatly simplifies the task of linking them together both externally and internally.

The Middle Bit

access_flags (2 bytes)

This shows provide information about the class, by ORing the following flags together:
  • ACC_PUBLIC
  • ACC_FINAL
  • ACC_SUPER
  • ACC_INTERFACE
  • ACC_ABSTRACT

this_class

These 2 bytes are an index to a CONSTANT_Class entry in the constant_pool, which should provide the name of this class.

super_class

Like this_class, but provides the name of the class's parent class. Remember that Java only has single-inheritance, so there can not be more than one immediate base class.

Interfaces

2 bytes for the interfaces_count, and then a table of CONSTANT_InterfaceRef indexes, showing which interfaces this class 'implements' or 'extends'..

Fields

After the interfaces table, there are 2 bytes for the fields_count, followed by a table of field_info tables.
Each field_info table contains the following information:
Length (number of bytes)Description
access_flags2e.g. ACC_PUBLIC, ACC_PRIVATE, etc
name_index2Index of a CONSTANT_Utf8
descriptor_index2Index of a CONSTANT_Utf8 (see type descriptors)
attributes_count2
attributesvariese.g. Constant Value. (see attributes)

Methods

 After the fields table, there are 2 bytes for the methods_count, followed by a table of method_info tables. This has the same entries as the field_info table, with the following differences:
  • The access_flags are slightly different.
  • The descriptor has a slightly different format (see type descriptors)
  • A different set attributes are included in the attributes table - most importantly the 'code' attribute which contains the Java bytecode for the method. (see attributes)

Type Descriptors

Field and Method types are represented, in a special notation, by a string. This notation is described below.

Primitive Types

Primitive types are represented by one of the following characters:
byteB
charC
doubleD
floatF
intI
longJ
shortS
booleanZ
For instance, an integer field would have a descriptor of "I".

Classes

Classes are indicated by an 'L', followed by the path to the class name, then a semi-colon to mark the end of the class name.
For instance, a String field would have a descriptor of "Ljava/lang/String;"

Arrays

Arrays are indicated with a '[' character.
For instance an array of Integers would have a descriptor of "[I".
Multi-dimensional arrays simply have extra '[' characters. For instance, "[[I".

Field Descriptors

A field has just one type, described in a string in the above notation. e.g. "I", or "Ljava/lang/String".

Method Descriptors

Because methods involve several types - the arguments and the return type - their type descriptor notation is slightly different. The argument types are at the start of the string inside brackets, concatenated together. Note that the type descriptors are concatenated without any separator character. The return type is after the closing bracket.
For instance, "int someMethod(long lValue, boolean bRefresh);" would have a descriptor of "(JZ)I". 

Attributes

Both the field_info table and the method_info table include a list of attributes. Each attribute starts with the index of a CONSTANT_Utf8 (2 bytes) and then the length of the following data (4 bytes). The structure of the following data depends on the particular attribute type. This allows new or custom attributes to be included in the class file without disrupting the existing structure, and without requiring recognition in the JVM specification. Any unrecognised attribute types will simply be ignored.
Attributes can contain sub-attributes. For instance, the code attribute can contain a LineNumberTable attribut
Here are some possible attributes:
CodeDetails, including bytecode, of a method's code.
ConstantValueUsed by 'final' fields
ExceptionsExceptions thrown by a method.
InnerClassesA class's inner classes.
LineNumberTableDebugging information
LocalVariableTableDebugging information.
SourceFileSource file name.
SyntheticShows that the field or method was generated by the compiler.

Code attribute

The Code attribute is used by the method_info table. It is where you will find the actual bytecodes (opcodes an operands) of the method's classes.
The attributes has the following structure:
Length (number of bytes)Description
max_stack2Size of stack required by the method's code.
max_locals2Number of local variables required by the method's code.
code_length2
codecode_lengthThe method's executable bytecodes
exception_table_length2
exception_tablevariesThe exceptions which the method can throw.
attributes_count2
attributesvariese.g. LineNumberTable
Each exception table entry has the following structure, each describing one exception catch:
Length (number of bytes)Description
start_pc2Offset of start of try/catch range.
end_pc2Offset of end of try/catch range.
handler_pc2Offset of start of exception handler code.
catch_type2Type of exception handled.
These entries for the Code attribute will probably only make sense to you if you are familiar with the rest of the JVM specification.
source: http://www.murrayc.com/learning/java/java_classfileformat.shtml

No comments:

Post a Comment