Java Rumblings: The Java Class File Format

Sunday, 26 September 2010

The Java Class File Format

Introduction

Compiled binary executables for different platforms usually differ not only in the instruction set, libraries, and APIs at which they are aimed, but also by the file format which is used to represent the program code. For instance, Windows uses the COFF file format, while Linux uses the ELF file format. Because Java aims at binary compatibility, it needs a universal file format for its programs - the Class File format.
The class file consists of some data of fixed length and lots of data of variable length and quantity, often nested inside other data of variable length. Therefore it is generally necessary to parse the whole file to read any one piece of data, because you will not know where that data is until you have worked your way through all the data before it. The JVM would just read the class file once and either store the data from the class file temporarily in a more easily accessed (but larger) format, or just remember where everything is in the class file. For this reason, a surprisingly large amount of the code of any JVM will be concerned with the interpretation, mapping, and possibly caching of this class file format.
Please note that this document is just an overview. The actual Class File format description is in the 'Java Virtual Machine Specification' which can be found online here, or in printed form here.

The Start

The Class file starts with the following bytes:
Length (number of bytes) Example
magic 4 0xCAFEBABE
minor_version 2 0x0003
major_version 2 0x002D
The 'magic' bytes are always set to 0xCAFEBABE and are simply a way for the JVM to check that it has loaded a class file rather than some other set of bytes.
The version bytes identify the version of the Class File format which this file conforms to. Obviously a JVM would have trouble reading a class file format which was defined after that JVM was written. Each new version of the JVM specification generally says what range of Class File versions it should be able to process.

Constant Pool

This major_version is followed by the constant_pool_count (2 bytes), and the constant_pool table.
The constant_pool table consists of several entries which can be of various types, and therefore of variable length. There are constant_pool_count - 1 entries, and each entries is referred to by its 1-indexed position in the table. Therefore, the first item is referred to as Constant Pool item 1. An index into the Constant Pool table can be store in 2 bytes.
Each entry can be one of the following, and is identified by the tag byte at the start of the entry.
Tag Contents
CONSTANT_Class 7 The name of a class
CONSTANT_Fieldref 9 The name and type of a Field, and the class of which it is a member.
CONSTANT_Methodref 10 The name and type of a Method, and the class of which it is a member.
CONSTANT_InterfaceMethodref 11 The name and type of a Interface Method, and the Interface of which it is a member.
CONSTANT_String 8 The index of a CONSTANT_Utf8 entry.
CONSTANT_Integer 3 4 bytes representing a Java integer.
CONSTANT_Float 4 4 bytes representing a Java float.
CONSTANT_Long 5 8 bytes representing a Java long.
CONSTANT_Double 6 8 bytes representing a Java double.
CONSTANT_NameAndType 12 The Name and Type entry for a field, method, or interface.
CONSTANT_Utf8 1 2 bytes for the length, then a string in Utf8 (Unicode) format.
Note that the primitive types, such as CONSTANT_Integer, are stored in big-endian format, with the most significant bits first. This is the most obvious and intuitive way of storing values, but some processors (in particular, Intel x86 processors) use values in little-endian format, so the JVM may need to manipulate these bytes to get the data into the correct form.
Many of these entries refer to other entries, but they generally end up referring to one or more Utf8 entries.
For instance, here is are the levels of containment for a CONSTANT_Fieldref entry:
CONSTANT_Fieldref
    index to a CONSTANT_Class entry
        index to a CONSTANT_Utf8 entry
    index to a CONSTANT_NameAndType entry
        index to a CONSTANT_Utf8 entry (name)
        index to a CONSTANT_Utf8 entry (type descriptor)
Note that simple text names are used to identify entities such as classes, fields, and methods. This greatly simplifies the task of linking them together both externally and internally.

The Middle Bit

access_flags (2 bytes)
This shows provide information about the class, by ORing the following flags together:
ACC_PUBLIC
ACC_FINAL
ACC_SUPER
ACC_INTERFACE
ACC_ABSTRACT
this_class
These 2 bytes are an index to a CONSTANT_Class entry in the constant_pool, which should provide the name of this class.
super_class
Like this_class, but provides the name of the class's parent class. Remember that Java only has single-inheritance, so there can not be more than one immediate base class.
Interfaces
2 bytes for the interfaces_count, and then a table of CONSTANT_InterfaceRef indexes, showing which interfaces this class 'implements' or 'extends'..

Fields

After the interfaces table, there are 2 bytes for the fields_count, followed by a table of field_info tables.
Each field_info table contains the following information:
Length (number of bytes) Description
access_flags 2 e.g. ACC_PUBLIC, ACC_PRIVATE, etc
name_index 2 Index of a CONSTANT_Utf8
descriptor_index 2 Index of a CONSTANT_Utf8 (see type descriptors)
attributes_count 2
attributes varies e.g. Constant Value. (see attributes)

Methods

After the fields table, there are 2 bytes for the methods_count, followed by a table of method_info tables. This has the same entries as the field_info table, with the following differences:
The access_flags are slightly different.
The descriptor has a slightly different format (see type descriptors)
A different set attributes are included in the attributes table - most importantly the 'code' attribute which contains the Java bytecode for the method. (see attributes)

Type Descriptors

Field and Method types are represented, in a special notation, by a string. This notation is described below.
Primitive Types
Primitive types are represented by one of the following characters:
byte B
char C
double D
float F
int I
long J
short S
boolean Z
For instance, an integer field would have a descriptor of "I".
Classes
Classes are indicated by an 'L', followed by the path to the class name, then a semi-colon to mark the end of the class name.
For instance, a String field would have a descriptor of "Ljava/lang/String;"
Arrays
Arrays are indicated with a '[' character.
For instance an array of Integers would have a descriptor of "[I".
Multi-dimensional arrays simply have extra '[' characters. For instance, "[[I".
Field Descriptors
A field has just one type, described in a string in the above notation. e.g. "I", or "Ljava/lang/String".
Method Descriptors
Because methods involve several types - the arguments and the return type - their type descriptor notation is slightly different. The argument types are at the start of the string inside brackets, concatenated together. Note that the type descriptors are concatenated without any separator character. The return type is after the closing bracket.
For instance, "int someMethod(long lValue, boolean bRefresh);" would have a descriptor of "(JZ)I".

Attributes

Both the field_info table and the method_info table include a list of attributes. Each attribute starts with the index of a CONSTANT_Utf8 (2 bytes) and then the length of the following data (4 bytes). The structure of the following data depends on the particular attribute type. This allows new or custom attributes to be included in the class file without disrupting the existing structure, and without requiring recognition in the JVM specification. Any unrecognised attribute types will simply be ignored.
Attributes can contain sub-attributes. For instance, the code attribute can contain a LineNumberTable attribut
Here are some possible attributes:
Code Details, including bytecode, of a method's code.
ConstantValue Used by 'final' fields
Exceptions Exceptions thrown by a method.
InnerClasses A class's inner classes.
LineNumberTable Debugging information
LocalVariableTable Debugging information.
SourceFile Source file name.
Synthetic Shows that the field or method was generated by the compiler.
Code attribute
The Code attribute is used by the method_info table. It is where you will find the actual bytecodes (opcodes an operands) of the method's classes.
The attributes has the following structure:
Length (number of bytes) Description
max_stack 2 Size of stack required by the method's code.
max_locals 2 Number of local variables required by the method's code.
code_length 2
code code_length The method's executable bytecodes
exception_table_length 2
exception_table varies The exceptions which the method can throw.
attributes_count 2
attributes varies e.g. LineNumberTable
Each exception table entry has the following structure, each describing one exception catch:
Length (number of bytes) Description
start_pc 2 Offset of start of try/catch range.
end_pc 2 Offset of end of try/catch range.
handler_pc 2 Offset of start of exception handler code.
catch_type 2 Type of exception handled.
These entries for the Code attribute will probably only make sense to you if you are familiar with the rest of the JVM specification.

source: http://www.murrayc.com/learning/java/java_classfileformat.shtml

Java Rumblings

Pages

Sunday, 26 September 2010

The Java Class File Format

Introduction

The Start

Constant Pool

The Middle Bit

access_flags (2 bytes)

this_class

super_class

Interfaces

Fields

Methods

Type Descriptors

Primitive Types

Classes

Arrays

Field Descriptors

Method Descriptors

Attributes

Code attribute

No comments:

Post a Comment

	Length (number of bytes)	Example
magic	4	0xCAFEBABE
minor_version	2	0x0003
major_version	2	0x002D

	Tag	Contents
CONSTANT_Class	7	The name of a class
CONSTANT_Fieldref	9	The name and type of a Field, and the class of which it is a member.
CONSTANT_Methodref	10	The name and type of a Method, and the class of which it is a member.
CONSTANT_InterfaceMethodref	11	The name and type of a Interface Method, and the Interface of which it is a member.
CONSTANT_String	8	The index of a CONSTANT_Utf8 entry.
CONSTANT_Integer	3	4 bytes representing a Java integer.
CONSTANT_Float	4	4 bytes representing a Java float.
CONSTANT_Long	5	8 bytes representing a Java long.
CONSTANT_Double	6	8 bytes representing a Java double.
CONSTANT_NameAndType	12	The Name and Type entry for a field, method, or interface.
CONSTANT_Utf8	1	2 bytes for the length, then a string in Utf8 (Unicode) format.

	Length (number of bytes)	Description
access_flags	2	e.g. ACC_PUBLIC, ACC_PRIVATE, etc
name_index	2	Index of a CONSTANT_Utf8
descriptor_index	2	Index of a CONSTANT_Utf8 (see type descriptors)
attributes_count	2
attributes	varies	e.g. Constant Value. (see attributes)

byte	B
char	C
double	D
float	F
int	I
long	J
short	S
boolean	Z

Code	Details, including bytecode, of a method's code.
ConstantValue	Used by 'final' fields
Exceptions	Exceptions thrown by a method.
InnerClasses	A class's inner classes.
LineNumberTable	Debugging information
LocalVariableTable	Debugging information.
SourceFile	Source file name.
Synthetic	Shows that the field or method was generated by the compiler.

	Length (number of bytes)	Description
max_stack	2	Size of stack required by the method's code.
max_locals	2	Number of local variables required by the method's code.
code_length	2
code	code_length	The method's executable bytecodes
exception_table_length	2
exception_table	varies	The exceptions which the method can throw.
attributes_count	2
attributes	varies	e.g. LineNumberTable

	Length (number of bytes)	Description
start_pc	2	Offset of start of try/catch range.
end_pc	2	Offset of end of try/catch range.
handler_pc	2	Offset of start of exception handler code.
catch_type	2	Type of exception handled.