This document somewhat more formally presents the JSON Pull Parser mentioned in two previous blog posts.
Purpose
As JSON has become more prevalent in Web applications,
Java on the Web need a JSON parser.
JDK 11 removed javax.json
and JSONP, and while the Jakarta project
continues to develop JSONP as jakarta.json
,
developers may want something simpler and more lightweight.
The StAX parser demonstrated that pull parsing provides a lightweight API that, unlike the “push parsing” model of SAX, keeps the application code in control without the memory overhead of DOM. For example, in JSON-RPC can stream the outer envelope to glean the id and method element then delegate params parsing to the code that performs the procedure. The only overhead is that the parser can’t use the Java call stack to track parsing state, and must save state when the application regains control.
A Java application that marshals and unmarshals JSON as a simple communication protocol would benefit from a simple pull parser, unlike applications that primarily manipulate and transform JSON which would benefit more from JSONP.1 Therefore I decided to write a JSONPP library with the following goals:
- A simple, easy-to-use API that inverts the usual event-driven model.
- An efficient parsing model that uses JSON’s simplicity.
- Careful use of Java memory to avoid creating more garbage than needed.
- Future proofing by using Unicode code points natively.
Design
Interfaces
These are the three public interfaces in the parser:
JsonEvent
enumerates all events in a JSON document.JsonPullParser
provides a (*cough*) stream of events at the user’s pace.JsonPullParserFactory
creates a pull parser around anInputStream
orReader
without the client knowning what specific class it’s using. It uses the well known (and arguably overused) factory method pattern.
One would use it something like this:
try {
var parser = factory.createParser(reader);
while (parser.hasNext()) {
parser.next();
switch (parser.getEvent()) {
case START_ARRAY:
// and so on ...
}
}
} finally {
reader.close();
}
Events happen in this order:
START_STREAM
- One2 of:
- a JSON Array:
START_ARRAY
- Zero or more values:
- String, Number,
true
,false
,null
, nested Array, or Object.
- String, Number,
END_ARRAY
- a JSON Object:
START_OBJECT
- Zero or more key-value pairs:
- the key:
KEY_NAME
- the value: a String, Number,
true
,false
,null
, Array, or nested Object.
- the key:
END_OBJECT
- a JSON String:
VALUE_STRING
- a JSON Number:
VALUE_NUMBER
- a JSON true:
VALUE_TRUE
- a JSON false:
VALUE_FALSE
- a JSON null:
VALUE_NULL
- a JSON Array:
END_STREAM
State Machine
Without realizing it, I was designing a finite-state machine.
As those states get a little confusing in Javadoc (before it’s javadoc
ced)
this table summarizes the value of every method.
Event | isInArray |
isInObject |
getString |
getNumber |
getCurrentKey |
---|---|---|---|---|---|
START_STREAM |
F | F | - | - | - |
START_ARRAY |
T | F | - | - | - |
END_ARRAY |
? | ? | - | - | (enclosing?) |
START_OBJECT |
F | T | - | - | - |
END_OBJECT |
? | ? | - | - | (enclosing?) |
KEY_NAME |
F | T | key | - | key |
VALUE_NULL |
? | ? | - | - | last |
VALUE_TRUE |
? | ? | - | - | last |
VALUE_FALSE |
? | ? | - | - | last |
VALUE_NUMBER |
? | ? | number | number | last |
VALUE_STRING |
? | ? | string | - | last |
END_STREAM |
F | F | - | - | - |
-
isInObject
andisInArray
reflect whether the current object being parsed is a JSON Object or Array. Their state during any event depends on whether the most recentSTART_
* event not matched by a correspondingEND_
* is an_OBJECT
or_ARRAY
. -
getString
andgetNumber
provide the current JSON String or Number value parsed.getString
also reflects the last JSON Object key parsed, and the string value of a Number. -
getCurrentKey
tracks what key the parser is currently parsing, if any. Thus it reports on the last key seen, or, when a JSON Object or Array ends, the key that object should be assigned in the enclosing JSON Object.
Implementation
In 2019 I cut a somewhat rickety version. Last month I cleaned it up a bit and posted it on GitHub. Anyone can pull from it (I think) but so far no one but me can push.
As of this writing, this is old code. I’m working on a few improvements, beyond the many, many items on my TODO list.
Learning By Coding
Two less public interfaces grew to prominence as coding went on:
-
JsonLexer
separates the “lexing” (breaking into “words”) step from the “parsing” (making sense of the words) step. It only has three methods (void next()
,CharSequence getToken()
, andint getTokenType()
and a bunch ofint
s representing tokens in the JSON spec. In retrospect this was overkill, but I think I wanted to do this project “right”. That proved elusive. -
CodePointSource
started as future-proofing, since as far as I can tell Java has no consistent interface for reading Unicode code points from outside. (Some classes useIntStream
, but only on internal representations with no risk that reading may fail.) As I kept working, though, I realized my CodePoint interface really did fill a niche, so I decided to make it its own library.
The Name Game
I went through several rounds of renaming, particularly CodePointSource
.
(In the Jan 29 versions I renamed it Source
, but it’s back to the
awkward sounding CodePointSource.) I also separated CodePointSource and
the inevitable CodePointSink into their own packages, now its own library.
The Pre-Fetching Problem
While testing my old code, I found that it often called getCodePoint()
before calling next()
to fetch a code point. Apparently I’d kludged
the WriterSource
to fetch the first code point (or at least char
)
upon creation, and that DefaultJsonLexer
(as I eventually called it)
depended on that behavior.
Untangling that has taken a lot of my time, and I still need to fix one
test that broke when I enforced calling those methods in the right order.
Unlike java.util.Iterator
and its ilk, I separated advancing to the next
character and fetching that character because, in part, I didn’t want to
“push back” a character the lexer had read that formed part of the next token.
Instead of re-pushing and popping characters, code can inspect a
CodePointSource’s current state without changing it.
Codepoint
As I mentioned earlier, the simple API to abstract out Unicode code points
from the underlying source – Writer
s, InputStream
s, CharBuffer
s,
even a java.nio.ByteBuffer
of ASCII, Latin-1, or UTF-8 bytes took on
a life of its own. As I planned other parser projects, I split it off
into its own library called “CodePoint”.
A facade called CodePoint
hides the exact classes used to wrap I/O
classes and buffers. Instead, a caller provides the object and its class
or interface to wrap, and CodePoint
uses something like3 the
java.util.ServiceLoader
to load a wrapper for the provided class.
Thus clients really need to know only the CodePoint
class and the two
interfaces, and implementors can load new implementations with a configuration
file in their library’s jar. Ideally. I’m still working on it.
JSON Push Producer
Flushed with success4 I decided to design a JSON emitter which took the parser API and reversed it to create the JSON Push Producer. (JSONPP? Get it?)
However, after writing a test code like this:
_producer.setEvent(JsonEvent.START_ARRAY);
_producer.push();
_producer.setEvent(JsonEvent.END_ARRAY);
_producer.push();
… just to produce an empty array []
,
I now think I’ll steal from JSONP consciously this time,
specifically the Json
*Builder
s from
javax.json.stream
and add writeTo(Reader)
and
writeTo(InpuStream,Charset)
methods.
I may pick some other alliterative appelation like “Basic Builder”.
Open Issues
The JSONPP TODO.md and CodePoint TODO.md list all the major and minor issues of which I’m aware. To highlight a few not already mentioned:
-
Both projects need a working Ant file that builds outside any IDE. It shouldn’t be hard; I just haven’t had the time yet.
-
JSONPP needs an architecture, preferably reflection-based, that will turn key-value properties into configuration for every instance. For example, are two or more values permitted on the same stream? The spec says no, but many implementations stream multiple requests or responses on the same socket connection.
-
JSONPP (and maybe CodePoint) need not only more aggressive error detection but more precise error reporting. E.g. line and column, or at least a “JSON path”, for illegal JSON.
-
Internationalization would be nice. Right now all exception messages are hardcoded (English).
-
Actual numbers backing up the “efficient use of memory” are must-haves.
-
Reimplementing parts of JSONP with JSONPP would be nice, although maybe at this point a bit ambitious.
-
CodePoint should be able to read and write
ByteBuffer
s, and maybeCharBuffer
s and evenIntBuffer
s. I just don’t know NIO that well yet. -
I also don’t know Java 17 and Java 19 (and even Java 11) as well as I should. Maybe it makes CodePoint wholly obsolete?
-
However the CodePoint architecture turns out, I need to document it, and ideally auto-generate any configuration files (e.g. a list of
CodePointSource
andCodePointSink
implementations.) -
Do much, much, much more testing, ultimately including JSONP’s Technology Compatibility Kit (TCK).
API
Event
package com.frank_mitchell.jsonpp;
/**
* Enumeration of all possible JsonPullParser events.
*/
public enum JsonEvent {
/**
* Invalid JSON syntax.
*/
SYNTAX_ERROR,
/**
* Before first JSON element
*/
START_STREAM,
/**
* Start of JSON array ('[')
*/
START_ARRAY,
/**
* End of JSON array (']')
*/
END_ARRAY,
/**
* Start of JSON object ('{')
*/
START_OBJECT,
/**
* End of JSON object ('}')
*/
END_OBJECT,
/**
* Key of JSON object member ('"'...'"' ':')
*/
KEY_NAME,
/**
* JSON null ("null")
*/
VALUE_NULL,
/**
* JSON boolean true ("true")
*/
VALUE_TRUE,
/**
* JSON boolean false ("false")
*/
VALUE_FALSE,
/**
* JSON number
*/
VALUE_NUMBER,
/**
* JSON string ("...")
*/
VALUE_STRING,
/**
* After last JSON element
*/
END_STREAM
};
Pull Parser
package com.frank_mitchell.jsonpp;
import java.io.Closeable;
import java.io.IOException;
/**
* This interface traverses a JSON Value as a stream of events.
*
* Each call to {@link #next()} moves to the next event in the stream, and the
* various "get" methods identify the type of event, the value of a String or
* Number, and/or the name of a key String.
*
* Implementations of this interface aren't guaranteed to be thread safe. In
* most cases one thread will parse an input stream and then discard this
* parser. In some cases one thread <strong>might</strong> hand a parser off to
* another thread, then continue parsing once that thread has finished. (In the
* latter case a co-routine or cooperative single-threaded framework might be
* more efficient.)
*
* @author Frank Mitchell
*
*/
public interface JsonPullParser extends Closeable {
/**
* Get the event parsed by the most recent call to {@link #next()}.
*
* @return most recently parsed event.
*/
public JsonEvent getEvent();
/**
* Indicates if the enclosing value is a JSON Array.
*
* If this object is currently processing the contents of a JSON Array, this
* method will return {@code true}.
*
* @return {@code true} if the enclosing value is a JSON Array.
*
* @see #isInObject()
*/
public boolean isInArray();
/**
* Indicates if the enclosing value is a JSON Object.
*
* If this parser is currently processing the contents of a JSON Object,
* this method will return {@code true}. If neither this method nor
* {@link #isInArray()} are true, this parser is either at the start or end
* of the document, the document contains only an atomic value, or the
* parser encountered an error.
*
* @return {@code true} if the enclosing value is a JSON Object.
*/
public boolean isInObject();
/**
* Whether this implementation supports {@link #getCurrentKey}.
* While most should, some implementers may choose memory footprint
* and speed over convenience.
* Override this method for implementations that don't.
*
* @return whether {@link #getCurrentKey
*/
default public boolean isCurrentKeySupported() {
return true;
}
/**
* Gets the key associated with the current value.
*
* On {@link JsonEvent#KEY_NAME}, the result is the JSON Object key
* with outer quotes removed and backslash escapes resoved.
*
* On {@link JsonEvent#END_OBJECT},
* {@link JsonEvent#END_ARRAY},
* {@link JsonEvent#VALUE_STRING},
* {@link JsonEvent#VALUE_NUMBER},
* {@link JsonEvent#VALUE_TRUE},
* {@link JsonEvent#VALUE_FALSE}, or {@link JsonEvent#VALUE_NULL}, the
* result is the JSON Object key this value should be assigned to, if the
* enclosing construct is a JSON Object.
*
* On {@link JsonEvent#START_STREAM},
* {@link JsonEvent#START_ARRAY},
* {@link JsonEvent#START_OBJECT},
* {@link JsonEvent#END_STREAM}, or {@link JsonEvent#SYNTAX_ERROR} or if
* there is no immediately enclosing JSON object, this method returns
* {@code null};
*
* @return the value of a String or Number or {@code null}
* @throws UnsupportedOperationException if method not supported.
*/
public String getCurrentKey();
/**
* Gets the string value associated with the current event.
*
* On {@link JsonEvent#KEY_NAME}, the result is the JSON Object key with all
* escape sequences converted to their character values.
*
* On {@link JsonEvent#VALUE_STRING}, the result is the JSON String with all
* escape sequences converted to their character values.
*
* On a {@link JsonEvent#VALUE_NUMBER}, the result is the number as
* originally read.
*
* Otherwise the method throws an exception
*
* @return the value of a String or Number or {@code null}
*
* @throws IllegalStateException if the current event has no string value.
*/
public String getString();
/**
* Gets the {@link Number} value associated with the current event.
*
* If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
* returns an unspecified subclass of Number. Otherwise this method throws
* an exception.
*
* @return the value of the current JSON Number
*
* @throws IllegalStateException if the current event is not a number.
*/
public Number getNumber() throws IllegalStateException;
/**
* Gets the {@code double} value associated with the current event.
*
* If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
* returns an unspecified subclass of Number. Otherwise this method throws
* an exception.
*
* @return the value of the current JSON Number
*
* @throws IllegalStateException if the current event is not a number.
*/
default double getDouble() throws IllegalStateException {
Number n = getNumber();
if (n == null) {
return Double.NaN;
} else {
return n.doubleValue();
}
}
/**
* Gets the {@code int} value associated with the current event.
*
* If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
* returns an unspecified subclass of Number. Otherwise this method throws
* an exception.
*
* @return the value of the current JSON Number
*
* @throws IllegalStateException if the current event is not a number.
*/
default int getInt() throws IllegalStateException {
Number n = getNumber();
if (n == null) {
throw new IllegalStateException("!" + JsonEvent.VALUE_NUMBER);
}
return n.intValue();
}
/**
* Gets the {@code long} value associated with the current event.
*
* If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
* returns an unspecified subclass of Number. Otherwise this method throws
* an exception.
*
* @return the value of the current JSON Number
*
* @throws IllegalStateException
*/
default public long getLong() throws IllegalStateException {
Number n = getNumber();
if (n == null) {
throw new IllegalStateException("!" + JsonEvent.VALUE_NUMBER);
}
return n.longValue();
}
/**
* Advances to the next significant JSON element in the underlying stream.
*
* @throws IOException if the character source could not be read.
*/
public void next() throws IOException;
/**
* Equivalent to calling {@link #next()} followed by {@link #getEvent()}.
*
* @return most recently parsed event.
*
* @throws IOException if the character source could not be read
*/
default public JsonEvent nextEvent() throws IOException {
next();
return getEvent();
}
/**
* Close the underlying IO or NIO object.
*
* @throws IOException from the underlying object.
*/
@Override
void close() throws IOException;
}
Pull Parser Factory
package com.frank_mitchell.jsonpp;
import com.frank_mitchell.codepoint.CodePointSource;
import java.io.IOException;
import java.io.InputStream;
import java.io.Reader;
import java.nio.charset.Charset;
import java.util.Map;
/**
* Creates JsonPullParser instances without clients knowing the specific
* class(es) used.
*
* {@link #setConfiguration(Map)} provides a hook to configure a factory without
* knowing or caring what specific instance performs the work.
*/
public interface JsonPullParserFactory {
/**
* Create a parser to read {@code char}s.
*
* @param reader a stream of UTF-16 characters to parse
* @return new parser
* @throws IOException if source throws an IOException
*/
JsonPullParser createParser(Reader reader) throws IOException;
/**
* Create a parser to process an ASCII or UTF-8 stream.
*
* @param input a stream of ASCII or UTF-8 bytes
* @return new parser
* @throws IOException if source throws an IOException
*/
JsonPullParser createUtf8Parser(InputStream input) throws IOException;
/**
* Create a parser to process an encoded byte stream.
*
* @param input a stream of encoded bytes to parse
* @param enc the standard name for the stream's encoding
* @return new parser
* @throws IOException if source throws an IOException
*/
JsonPullParser createParser(InputStream input, Charset enc) throws IOException;
/**
* Create a parser to process a stream of Unicode code points.
*
* @param source provider of code points
* @return new parser
* @throws IOException if source throws an IOException
*/
JsonPullParser createParser(CodePointSource source) throws IOException;
}
Code Point Source
package com.frank_mitchell.codepoint;
import java.io.Closeable;
import java.io.IOException;
import java.util.Iterator;
/**
* An iterator over an external sequence of Unicode code points.
* Using {@code int} instead of {@code char} is a bit of
* future-proofing for when streams commonly contain characters
* outside of the Basic Multilingual Plane (0x0000 - 0xFFFF).
* Implementers can transparently decode UTF-8 or UTF-16 multi-byte
* characters into a single code point. (At least until Unicode expands
* past 32 bits.)
*
* Unlike standard Java {@link Iterator}s, advancing the iterator and
* reading the next item in the sequence can be two separate actions.
* That way one can pass the source to other methods and they can read
* the last code point read without altering state.
*
* @author Frank Mitchell
*/
public interface CodePointSource extends Closeable {
/**
* Read the current code point after the last call to {@link #next()}.
*
* @return current code point.
*/
int getCodePoint();
/**
* Whether this source still has code points remaining.
* This method may read ahead to the next character, which may
* cause an exception.
*
* @return whether this object has a next code point.
*
* @throws java.io.IOException if read-ahead throws an exception
*/
boolean hasNext() throws IOException;
/**
* Get the next code point.
*
* @throws IOException
*/
void next() throws IOException;
/**
* Close the underlying IO or NIO object.
*
* @throws IOException from the underlying object.
*/
@Override
void close() throws IOException;
}
MIT License
Copyright 2019, 2023 Frank Mitchell
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Unbeknownst to me when I started this project, JSONP’s
javax.json.stream
implements a pull parser too. ↩︎ -
An optional extension is to allow multiple values on, for example, half a network socket.
END_STREAM
would signal the closing of that half of the socket. ↩︎ -
ServiceLoader automatically instantiates a “service” class with a zero-argument constructor and caches it for further use. codepoint, on the other hand, instantiates a wrapper class with a constructor for the object’s class, superclass, or implemented interface and an optional
java.nio.Charset
, then throws it away when reading is done. I thought about changing the protocol to a zero-length constructor followed by a method to sent the current input method, but not only is that hard with generics it requires wrappers to reset their state if that method’s called again. I decided to got ith the usual Java convention of using an input instance once, closing it, then throwing it away. ↩︎ -
Metaphorically. I doubt my skin ever loses its pallor. ↩︎