Variable length arrays part2
This document introduces an approach
to Variable Length Arrays (VLA) which builds
on "Fixed header variable
tail" described in the previous document. In this
approach there is no limitation on
the number of VLAs an enclosing layout may have.
This document proposes the idea of
multiple VLAs in layouts to gauge whether there
is interest in supporting more
complicated types of VLAs. It also discusses the new
issues that arise as more
capabilities are introduced.
There are certain types of data structures
that are difficult to express in C-like
languages, examples are a Java Class
File (JCF) or Protobuf types (varints,
packed
repeated fields). These types of
structures are common in real-world applications
and many people have written
solutions using Java, C, and other languages that
could be greatly simplified with the
approach proposed in this document. A layout
solution would allow people to use
LDL to describe a structure and interop with it
directly from Java instead of
writing additional code (native or Java) to parse it.
It would let developers quickly
prototype with these data layouts while leveraging
the performance capabilities of the
JVM for a production environment.
The following paragraphs will
demonstrate why the approach discussed in the
previous document (fixed header
variable tail) is insufficient for supporting a
JCF. It will also identify the
features required to make this possible. This
exercise will provide a good indication
of the capabilities required for VLAs to
support complex variable sized
layouts.
The previous document described an
approach "fixed header repeating tail" that is
only suitable for the most basic
types of variable sized structures. It cannot
describe complex variable structures
like a JCF. One of the limitations of "fixed
header repeating tail" is that
it only permits a single VLA in the layout and the
VLA must be at the end. A JCF has
many VLAs, as seen below, and as a result a few
of them are in the middle. Using the
basic approach, one would have to create
multiple Layouts and manually bind
them in order to represent the JCF. This may
impose additional security checks at
each layout bind operation. Also, it would
complicate the code as the user
would not have one layout that represents the JCF
but multiple. A layout solution
would greatly simplify this as one would only
have to define the structure (like
below) using LDL. For Layouts to support
structures like a JCF they must be
able to support multiple VLAs in a layout.
//c like representation
struct JavaClassFile
{
jint magicNumber;
...
jshort constantPoolCount;
CPInfo constantPool[constantPoolCount - 1];//VLA, more on this later
....
jshort interfacesCount;
jshort
interfaces[interfacesCount];//VLA
jshort fieldsCount;
Field fields[fieldsCount];//VLA
jshort methodsCount;
Method methods[methodsCount];//VLA
jshort attributesCount;
Attribute attributes[attributesCount];//VLA
}
The previous document specified that
the element type of a VLA cannot be a VLA
(or a layout containing a VLA). This
limitation restricts the types of structures
that can be described as it is
sometimes necessary to have VLAs nested in VLAs.
For example, the JCF contains a VLA
of Field. Field itself contains a VLA of
Attribute which also contains a VLA
of jbyte. To get around this limitation, a
user would need to write code that
walks each element of the VLA and binds a new
layout to that location (similar to
the previous case). Layouts will have to
support VLAs nested in VLAs in order
to completely describe complex structures.
struct JavaClassFile
{
...
jshort fieldsCount;
Field fields[fieldsCount];//VLA
...
}
struct Field {
jshort accessFlags;
...
jshort attributesCount;
Attributes
attributeInfo[attributesCount];//VLA
}
struct Attributes {
jshort attrNameIndex;
jint attrLength;
...
jbyte info[attrLength];//VLA
}
Another limitation of the approach
described in the previous document is that it
can only describe a VLA where the
length is the value in the repeat count field.
In some cases
the actual length of the VLA is the value of the repeat count field
with an operation applied to it. For
example, the JCF has a field "constantPool"
which is a VLA of CPInfo where the length is "constantPoolCount
- 1". This needs
to be addressed for layouts to
support complex variable structures.
struct JavaClassFile
{
...
jshort constantPoolCount;
CPInfo constantPool[constantPoolCount - 1];//VLA
...
}
There are some situations where the
length of the VLA may depend on multiple fields.
For example, the length of a UTF
constant pool entry is determined by the first byte
(indicates it is a UTF entry) and
the 2nd and 3rd bytes which indicate the length
of the UTF.
The major features that layouts must
support in order to describe complex variable
structures such as the JCF are:
- multiple VLAs in a layout
- VLAs nested in VLAs
- VLAs where length is determined by
an operation applied to a field (or fields)
This document focuses only on
"multiple VLAs in a layout" as the next step in
being able to describe complex types
of structures. The issues uncovered here
will shed light on how to best
specify the other features and will help to gauge
the interest in supporting complex
variable length structures. On its own, the
proposal in this document is still
insufficient to support a Layout representation
of the Java Class File format.
To support multiple VLAs in a
layout, some of the restrictions described in the
first document must be relaxed. Here
are the new rules:
- The repeat count field must be an
integral type (byte, short, int, long, char)
- The repeat count field cannot be
nested in another member
- The repeat count field must
precede the VLA
- VLA must be the last
field in the layout
-
- There can only be
one VLA in the enclosing layout
-
- A layout containing a VLA cannot
be nested in another layout
- The enclosing layout of a VLA is a
Variable sized layout
- The element type of a VLA (or
regular array) cannot be a Variable sized
layout or a VLA
- A VLA cannot be declared on its
own. It must be defined within a layout.
Example 1: Multiple VLAs in a layout
"A", //enclosing
layout
"w:jint:4", //repeat count field for x
"x:jbyte[w]:0", //VLA with size determined by
field 'w'
"y:jint:4", //repeat count field
for z
"z:jbyte[y]:0"; //VLA with size determined by
field 'y'
Multiple VLAs in a layout introduce
new concerns about writable repeat count
fields as it makes it possible to
introduce gaps in a layout (shrinking the VLA)
or overwrite fields that follow
(expanding the VLA). This was not a problem with
"fixed header repeating
tail" because there could only be one VLA in a layout and
it had to be at the end. This is a
problem when there are multiple VLAs because
changing the size of a field in the
middle of a layout affects all fields that
follow, and there could be many.
When memory is aliased gaps in a
layout will cause inconsistencies between views
of the same data. One solution is to
shift the data that follows the VLA when the
size is reduced so that gaps are
never created. But this could be impractical if
a Layout is very large. Also, all
fields following the VLA effectively get a new
offset which means any caching of
offsets for performance optimizations would be
invalidated. In addition, this also
assumes that it is possible to detect all
writes to the repeat count fields
and that race conditions will not occur.
A second solution is to simply allow
the repeat count field to change and make
users aware of the risks if they
decide to modify it.
A third solution (our current
preference) is to make the repeat count field
immutable provided that there are
bind operations (ie. bindLocation(Location loc,
long ... repeatCountInitializer))that make it possible to initialize the field.
The drawback is that it may not be
possible to make a field immutable as there are
many ways to alias data which could
circumvent any protections we put on Layout
operations that write memory.
In the previous case, "fixed
header repeating tail", the access time for every
field in a layout is equivalent. But
when more VLAs are added, fields no longer
have the same access time. The more VLAs are added the greater the disparity in
field access time. Fields near the
bottom of a layout may take longer to access
than fields near the top. Users need
to be aware of this behaviour as the costs
might be a surprise if Layouts
containing VLAs are nested in other Layouts.
Most users tend to access data
either in sequential or random order. Sequential
access is suitable for Layouts with
more than one VLA because in these cases
accessing a field requires one to
access all the repeat counters that precede it.
Accessing fields sequentially could
be easily optimized such that the repeat count
fields are read only once. Whereas,
accessing fields in random order would be more
difficult to optimize. If the JVM
can recognize the difference between the two
patterns there might be things it
can do to provide reasonable performance with
both access patterns.