Variable length arrays part2

1) Intro

This document introduces an approach to Variable Length Arrays (VLA) which builds

on "Fixed header variable tail" described in the previous document. In this

approach there is no limitation on the number of VLAs an enclosing layout may have.

This document proposes the idea of multiple VLAs in layouts to gauge whether there

is interest in supporting more complicated types of VLAs. It also discusses the new

issues that arise as more capabilities are introduced.

2) Motivation

There are certain types of data structures that are difficult to express in C-like

languages, examples are a Java Class File (JCF) or Protobuf types (varints, packed

repeated fields). These types of structures are common in real-world applications

and many people have written solutions using Java, C, and other languages that

could be greatly simplified with the approach proposed in this document. A layout

solution would allow people to use LDL to describe a structure and interop with it

directly from Java instead of writing additional code (native or Java) to parse it.

It would let developers quickly prototype with these data layouts while leveraging

the performance capabilities of the JVM for a production environment.

The following paragraphs will demonstrate why the approach discussed in the

previous document (fixed header variable tail) is insufficient for supporting a

JCF. It will also identify the features required to make this possible. This

exercise will provide a good indication of the capabilities required for VLAs to

support complex variable sized layouts.

The previous document described an approach "fixed header repeating tail" that is

only suitable for the most basic types of variable sized structures. It cannot

describe complex variable structures like a JCF. One of the limitations of "fixed

header repeating tail" is that it only permits a single VLA in the layout and the

VLA must be at the end. A JCF has many VLAs, as seen below, and as a result a few

of them are in the middle. Using the basic approach, one would have to create

multiple Layouts and manually bind them in order to represent the JCF. This may

impose additional security checks at each layout bind operation. Also, it would

complicate the code as the user would not have one layout that represents the JCF

but multiple. A layout solution would greatly simplify this as one would only

have to define the structure (like below) using LDL. For Layouts to support

structures like a JCF they must be able to support multiple VLAs in a layout.

//c like representation

struct JavaClassFile {

jint magicNumber;

...

jshort constantPoolCount;

CPInfo constantPool[constantPoolCount - 1];//VLA, more on this later

....

jshort interfacesCount;

jshort interfaces[interfacesCount];//VLA

jshort fieldsCount;

Field fields[fieldsCount];//VLA

jshort methodsCount;

Method methods[methodsCount];//VLA

jshort attributesCount;

Attribute attributes[attributesCount];//VLA

}

The previous document specified that the element type of a VLA cannot be a VLA

(or a layout containing a VLA). This limitation restricts the types of structures

that can be described as it is sometimes necessary to have VLAs nested in VLAs.

For example, the JCF contains a VLA of Field. Field itself contains a VLA of

Attribute which also contains a VLA of jbyte. To get around this limitation, a

user would need to write code that walks each element of the VLA and binds a new

layout to that location (similar to the previous case). Layouts will have to

support VLAs nested in VLAs in order to completely describe complex structures.

struct JavaClassFile {

...

jshort fieldsCount;

Field fields[fieldsCount];//VLA

...

}

struct Field {

jshort accessFlags;

...

jshort attributesCount;

Attributes attributeInfo[attributesCount];//VLA

}

struct Attributes {

jshort attrNameIndex;

jint attrLength;

...

jbyte info[attrLength];//VLA

}

Another limitation of the approach described in the previous document is that it

can only describe a VLA where the length is the value in the repeat count field.

In some cases the actual length of the VLA is the value of the repeat count field

with an operation applied to it. For example, the JCF has a field "constantPool"

which is a VLA of CPInfo where the length is "constantPoolCount - 1". This needs

to be addressed for layouts to support complex variable structures.

struct JavaClassFile {

...

jshort constantPoolCount;

CPInfo constantPool[constantPoolCount - 1];//VLA

...

}

There are some situations where the length of the VLA may depend on multiple fields.

For example, the length of a UTF constant pool entry is determined by the first byte

(indicates it is a UTF entry) and the 2nd and 3rd bytes which indicate the length

of the UTF.

The major features that layouts must support in order to describe complex variable

structures such as the JCF are:

- multiple VLAs in a layout

- VLAs nested in VLAs

- VLAs where length is determined by an operation applied to a field (or fields)

This document focuses only on "multiple VLAs in a layout" as the next step in

being able to describe complex types of structures. The issues uncovered here

will shed light on how to best specify the other features and will help to gauge

the interest in supporting complex variable length structures. On its own, the

proposal in this document is still insufficient to support a Layout representation

of the Java Class File format.

3) Multiple VLAs in a Layout

To support multiple VLAs in a layout, some of the restrictions described in the

first document must be relaxed. Here are the new rules:

- The repeat count field must be an integral type (byte, short, int, long, char)

- The repeat count field cannot be nested in another member

- The repeat count field must precede the VLA

- ~~VLA must be the last field in the layout~~

- VLA can be anywhere after its repeat count field

- ~~There can only be one VLA in the enclosing layout~~

- There can be multiple VLAs in the enclosing layout

- A layout containing a VLA cannot be nested in another layout

- The enclosing layout of a VLA is a Variable sized layout

- The element type of a VLA (or regular array) cannot be a Variable sized

layout or a VLA

- A VLA cannot be declared on its own. It must be defined within a layout.

Example 1: Multiple VLAs in a layout

"A", //enclosing layout

"w:jint:4", //repeat count field for x

"x:jbyte[w]:0", //VLA with size determined by field 'w'

"y:jint:4", //repeat count field for z

"z:jbyte[y]:0"; //VLA with size determined by field 'y'

4) Concerns

4.1) Are the repeat count fields writable?

Multiple VLAs in a layout introduce new concerns about writable repeat count

fields as it makes it possible to introduce gaps in a layout (shrinking the VLA)

or overwrite fields that follow (expanding the VLA). This was not a problem with

"fixed header repeating tail" because there could only be one VLA in a layout and

it had to be at the end. This is a problem when there are multiple VLAs because

changing the size of a field in the middle of a layout affects all fields that

follow, and there could be many.

When memory is aliased gaps in a layout will cause inconsistencies between views

of the same data. One solution is to shift the data that follows the VLA when the

size is reduced so that gaps are never created. But this could be impractical if

a Layout is very large. Also, all fields following the VLA effectively get a new

offset which means any caching of offsets for performance optimizations would be

invalidated. In addition, this also assumes that it is possible to detect all

writes to the repeat count fields and that race conditions will not occur.

A second solution is to simply allow the repeat count field to change and make

users aware of the risks if they decide to modify it.

A third solution (our current preference) is to make the repeat count field

immutable provided that there are bind operations (ie. bindLocation(Location loc,

long ... repeatCountInitializer))that make it possible to initialize the field.

The drawback is that it may not be possible to make a field immutable as there are

many ways to alias data which could circumvent any protections we put on Layout

operations that write memory.

4.2) Access performance

In the previous case, "fixed header repeating tail", the access time for every

field in a layout is equivalent. But when more VLAs are added, fields no longer

have the same access time. The more VLAs are added the greater the disparity in

field access time. Fields near the bottom of a layout may take longer to access

than fields near the top. Users need to be aware of this behaviour as the costs

might be a surprise if Layouts containing VLAs are nested in other Layouts.

Most users tend to access data either in sequential or random order. Sequential

access is suitable for Layouts with more than one VLA because in these cases

accessing a field requires one to access all the repeat counters that precede it.

Accessing fields sequentially could be easily optimized such that the repeat count

fields are read only once. Whereas, accessing fields in random order would be more

difficult to optimize. If the JVM can recognize the difference between the two

patterns there might be things it can do to provide reasonable performance with

both access patterns.