JSONPP: JSON Pull Parser (Work in Progress)

Frank Mitchell

Posted: 2023-02-15
Last Modified: 2023-04-13
Word Count: 3141
Tags: java json programming

Table of Contents

This document somewhat more formally presents the JSON Pull Parser mentioned in two previous blog posts.

Purpose

As JSON has become more prevalent in Web applications, Java on the Web need a JSON parser. JDK 11 removed javax.json and JSONP, and while the Jakarta project continues to develop JSONP as jakarta.json, developers may want something simpler and more lightweight.

The StAX parser demonstrated that pull parsing provides a lightweight API that, unlike the “push parsing” model of SAX, keeps the application code in control without the memory overhead of DOM. For example, in JSON-RPC can stream the outer envelope to glean the id and method element then delegate params parsing to the code that performs the procedure. The only overhead is that the parser can’t use the Java call stack to track parsing state, and must save state when the application regains control.

A Java application that marshals and unmarshals JSON as a simple communication protocol would benefit from a simple pull parser, unlike applications that primarily manipulate and transform JSON which would benefit more from JSONP.1 Therefore I decided to write a JSONPP library with the following goals:

  1. A simple, easy-to-use API that inverts the usual event-driven model.
  2. An efficient parsing model that uses JSON’s simplicity.
  3. Careful use of Java memory to avoid creating more garbage than needed.
  4. Future proofing by using Unicode code points natively.

Design

Interfaces

These are the three public interfaces in the parser:

One would use it something like this:

try {
    var parser = factory.createParser(reader);
    while (parser.hasNext()) {
        parser.next();
        switch (parser.getEvent()) {
            case START_ARRAY:
                // and so on ...
        }
    }
} finally {
    reader.close();
}

Events happen in this order:

  1. START_STREAM
  2. One2 of:
    • a JSON Array:
      1. START_ARRAY
      2. Zero or more values:
        • String, Number, Boolean, null, nested Array, or Object.
      3. END_ARRAY
    • a JSON Object:
      1. START_OBJECT
      2. Zero or more key-value pairs:
        1. the key: KEY_NAME
        2. the value: a String, Number, Boolean, null, Array, or nested Object.
      3. END_OBJECT
    • a JSON String: VALUE_STRING
    • a JSON Number: VALUE_NUMBER
    • a JSON Boolean true: VALUE_TRUE
    • a JSON Boolean false: VALUE_FALSE
    • a JSON null: VALUE_NULL
  3. END_STREAM

State Machine

Without realizing it, I was designing a finite-state machine.

As those states get a little confusing in Javadoc (before it’s javadocced) this table summarizes the value of every method.

Event isInArray isInObject getString getNumber getCurrentKey
START_STREAM F F - - -
START_ARRAY T F - - -
END_ARRAY ? ? - - (enclosing?)
START_OBJECT F T - - -
END_OBJECT ? ? - - (enclosing?)
KEY_NAME F T key - key
VALUE_NULL ? ? - - last
VALUE_TRUE ? ? - - last
VALUE_FALSE ? ? - - last
VALUE_NUMBER ? ? number number last
VALUE_STRING ? ? string - last
END_STREAM F F - - -

Implementation

In 2019 I cut a somewhat rickety version. Last month I cleaned it up a bit and posted it on GitHub. Anyone can pull from it (I think) but so far no one but me can push.

As of this writing, this is old code. I’m working on a few improvements, beyond the many, many items on my TODO list.

Learning By Coding

Two less public interfaces grew to prominence as coding went on:

The Name Game

I went through several rounds of renaming, particularly CodePointSource. (In the Jan 29 versions I renamed it Source, but it’s back to the awkward sounding CodePointSource.) I also separated CodePointSource and the inevitable CodePointSink into their own packages, now its own library.

The Pre-Fetching Problem

While testing my old code, I found that it often called getCodePoint() before calling next() to fetch a code point. Apparently I’d kludged the WriterSource to fetch the first code point (or at least char) upon creation, and that DefaultJsonLexer (as I eventually called it) depended on that behavior.

Untangling that has taken a lot of my time, and I still need to fix one test that broke when I enforced calling those methods in the right order. Unlike java.util.Iterator and its ilk, I separated advancing to the next character and fetching that character because, in part, I didn’t want to “push back” a character the lexer had read that formed part of the next token. Instead of re-pushing and popping characters, code can inspect a CodePointSource’s current state without changing it.

Codepoint

Updated 2023-03-31: Created new project page for CodePoint.

As I mentioned earlier, the simple API to abstract out Unicode code points took on a life of its own. Since I planned other parser projects, I split it off into its own library called “CodePoint”. It even has its own project page.

JSON Push Producer

Flushed with success3 I decided to design a JSON emitter which took the parser API and reversed it to create the JSON Push Producer. (JSONPP? Get it?)

However, after writing a test code like this:

_producer.setEvent(JsonEvent.START_ARRAY);
_producer.push();
_producer.setEvent(JsonEvent.END_ARRAY);
_producer.push();

… just to produce an empty array [], I now think I’ll steal from JSONP consciously this time, specifically the Json*Builders from javax.json.stream and add writeTo(Reader) and writeTo(InpuStream,Charset) methods. I may pick some other alliterative appelation like “Basic Builder”.

Open Issues

The JSONPP TODO.md and CodePoint TODO.md list all the major and minor issues of which I’m aware. To highlight a few not already mentioned:

MODIFIED 2023-04-13:

API

Event

package com.frank_mitchell.jsonpp;

/**
 * Enumeration of all possible JsonPullParser events.
 */
public enum JsonEvent {
    /**
     * Invalid JSON syntax.
     */
    SYNTAX_ERROR,

    /**
     * Before first JSON element
     */
    START_STREAM,

    /**
     * Start of JSON array ('[')
     */
    START_ARRAY,

    /**
     * End of JSON array (']')
     */
    END_ARRAY,

    /**
     * Start of JSON object ('{')
     */
    START_OBJECT,

    /**
     * End of JSON object ('}')
     */
    END_OBJECT,

    /**
     * Key of JSON object member ('"'...'"' ':')
     */
    KEY_NAME,

    /**
     * JSON null ("null")
     */
    VALUE_NULL,

    /**
     * JSON boolean true ("true")
     */
    VALUE_TRUE,

    /**
     * JSON boolean false ("false")
     */
    VALUE_FALSE,

    /**
     * JSON number
     */
    VALUE_NUMBER,

    /**
     * JSON string ("...")
     */
    VALUE_STRING,

    /**
     * After last JSON element
     */
    END_STREAM
};

Pull Parser

package com.frank_mitchell.jsonpp;

import java.io.Closeable;
import java.io.IOException;

/**
 * This interface traverses a JSON Value as a stream of events.
 *
 * Each call to {@link #next()} moves to the next event in the stream, and the
 * various "get" methods identify the type of event, the value of a String or
 * Number, and/or the name of a key String.
 *
 * Implementations of this interface aren't guaranteed to be thread safe. In
 * most cases one thread will parse an input stream and then discard this
 * parser. In some cases one thread <strong>might</strong> hand a parser off to
 * another thread, then continue parsing once that thread has finished. (In the
 * latter case a co-routine or cooperative single-threaded framework might be
 * more efficient.)
 *
 * @author Frank Mitchell
 *
 */
public interface JsonPullParser extends Closeable {

    /**
     * Get the event parsed by the most recent call to {@link #next()}.
     *
     * @return most recently parsed event.
     */
    public JsonEvent getEvent();

    /**
     * Indicates if the enclosing value is a JSON Array.
     *
     * If this object is currently processing the contents of a JSON Array, this
     * method will return {@code true}.
     *
     * @return {@code true} if the enclosing value is a JSON Array.
     *
     * @see #isInObject()
     */
    public boolean isInArray();

    /**
     * Indicates if the enclosing value is a JSON Object.
     *
     * If this parser is currently processing the contents of a JSON Object,
     * this method will return {@code true}. If neither this method nor
     * {@link #isInArray()} are true, this parser is either at the start or end
     * of the document, the document contains only an atomic value, or the
     * parser encountered an error.
     *
     * @return {@code true} if the enclosing value is a JSON Object.
     */
    public boolean isInObject();

    /**
     * Whether this implementation supports {@link #getCurrentKey}.
     * While most should, some implementers may choose memory footprint
     * and speed over convenience.
     * Override this method for implementations that don't.
     *
     * @return whether {@link #getCurrentKey
     */
    default public boolean isCurrentKeySupported() {
        return true;
    }

    /**
     * Gets the key associated with the current value.
     *
     * On {@link JsonEvent#KEY_NAME}, the result is the JSON Object key
     * with outer quotes removed and backslash escapes resoved.
     *
     * On {@link JsonEvent#END_OBJECT},
     * {@link JsonEvent#END_ARRAY},
     * {@link JsonEvent#VALUE_STRING},
     * {@link JsonEvent#VALUE_NUMBER},
     * {@link JsonEvent#VALUE_TRUE},
     * {@link JsonEvent#VALUE_FALSE}, or {@link JsonEvent#VALUE_NULL}, the
     * result is the JSON Object key this value should be assigned to, if the
     * enclosing construct is a JSON Object.
     *
     * On {@link JsonEvent#START_STREAM},
     * {@link JsonEvent#START_ARRAY},
     * {@link JsonEvent#START_OBJECT},
     * {@link JsonEvent#END_STREAM}, or {@link JsonEvent#SYNTAX_ERROR} or if
     * there is no immediately enclosing JSON object, this method returns
     * {@code null};
     *
     * @return the value of a String or Number or {@code null}
     * @throws UnsupportedOperationException if method not supported.
     */
    public String getCurrentKey();

    /**
     * Gets the string value associated with the current event.
     *
     * On {@link JsonEvent#KEY_NAME}, the result is the JSON Object key with all
     * escape sequences converted to their character values.
     *
     * On {@link JsonEvent#VALUE_STRING}, the result is the JSON String with all
     * escape sequences converted to their character values.
     *
     * On a {@link JsonEvent#VALUE_NUMBER}, the result is the number as
     * originally read.
     *
     * Otherwise the method throws an exception
     *
     * @return the value of a String or Number or {@code null}
     *
     * @throws IllegalStateException if the current event has no string value.
     */
    public String getString();

    /**
     * Gets the {@link Number} value associated with the current event.
     *
     * If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
     * returns an unspecified subclass of Number. Otherwise this method throws
     * an exception.
     *
     * @return the value of the current JSON Number
     *
     * @throws IllegalStateException if the current event is not a number.
     */
    public Number getNumber() throws IllegalStateException;

    /**
     * Gets the {@code double} value associated with the current event.
     *
     * If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
     * returns an unspecified subclass of Number. Otherwise this method throws
     * an exception.
     *
     * @return the value of the current JSON Number
     *
     * @throws IllegalStateException if the current event is not a number.
     */
    default double getDouble() throws IllegalStateException {
        Number n = getNumber();
        if (n == null) {
            return Double.NaN;
        } else {
            return n.doubleValue();
        }
    }

    /**
     * Gets the {@code int} value associated with the current event.
     *
     * If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
     * returns an unspecified subclass of Number. Otherwise this method throws
     * an exception.
     *
     * @return the value of the current JSON Number
     *
     * @throws IllegalStateException if the current event is not a number.
     */
    default int getInt() throws IllegalStateException {
        Number n = getNumber();
        if (n == null) {
            throw new IllegalStateException("!" + JsonEvent.VALUE_NUMBER);
        }
        return n.intValue();
    }

    /**
     * Gets the {@code long} value associated with the current event.
     *
     * If {@link #getEvent()} is {@link JsonEvent#VALUE_NUMBER}, this method
     * returns an unspecified subclass of Number. Otherwise this method throws
     * an exception.
     *
     * @return the value of the current JSON Number
     *
     * @throws IllegalStateException
     */
    default public long getLong() throws IllegalStateException {
        Number n = getNumber();
        if (n == null) {
            throw new IllegalStateException("!" + JsonEvent.VALUE_NUMBER);
        }
        return n.longValue();
    }

    /**
     * Advances to the next significant JSON element in the underlying stream.
     *
     * @throws IOException if the character source could not be read.
     */
    public void next() throws IOException;

    /**
     * Equivalent to calling {@link #next()} followed by {@link #getEvent()}.
     *
     * @return most recently parsed event.
     *
     * @throws IOException if the character source could not be read
     */
    default public JsonEvent nextEvent() throws IOException {
        next();
        return getEvent();
    }

    /**
     * Close the underlying IO or NIO object.
     *
     * @throws IOException from the underlying object.
     */
    @Override
    void close() throws IOException;
}

Pull Parser Factory

package com.frank_mitchell.jsonpp;

import com.frank_mitchell.codepoint.CodePointSource;
import java.io.IOException;
import java.io.InputStream;
import java.io.Reader;
import java.nio.charset.Charset;
import java.util.Map;

/**
 * Creates JsonPullParser instances without clients knowing the specific
 * class(es) used.
 *
 * {@link #setConfiguration(Map)} provides a hook to configure a factory without
 * knowing or caring what specific instance performs the work.
 */
public interface JsonPullParserFactory {

    /**
     * Create a parser to read {@code char}s.
     *
     * @param reader a stream of UTF-16 characters to parse
     * @return new parser
     * @throws IOException if source throws an IOException
     */
    JsonPullParser createParser(Reader reader) throws IOException;

    /**
     * Create a parser to process an ASCII or UTF-8 stream.
     *
     * @param input a stream of ASCII or UTF-8 bytes
     * @return new parser
     * @throws IOException if source throws an IOException
     */
    JsonPullParser createUtf8Parser(InputStream input) throws IOException;

    /**
     * Create a parser to process an encoded byte stream.
     *
     * @param input a stream of encoded bytes to parse
     * @param enc   the standard name for the stream's encoding
     * @return new parser
     * @throws IOException if source throws an IOException
     */
    JsonPullParser createParser(InputStream input, Charset enc) throws IOException;

    /**
     * Create a parser to process a stream of Unicode code points.
     *
     * @param source provider of code points
     * @return new parser
     * @throws IOException if source throws an IOException
     */
    JsonPullParser createParser(CodePointSource source) throws IOException;
}

Code Point Source

package com.frank_mitchell.codepoint;

import java.io.Closeable;
import java.io.IOException;
import java.util.Iterator;

/**
 * An iterator over an external sequence of Unicode code points.
 * Using {@code int} instead of {@code char} is a bit of 
 * future-proofing for when streams commonly contain characters
 * outside of the Basic Multilingual Plane (0x0000 - 0xFFFF).
 * Implementers can transparently decode UTF-8 or UTF-16 multi-byte
 * characters into a single code point.  (At least until Unicode expands
 * past 32 bits.)
 * 
 * Unlike standard Java {@link Iterator}s, advancing the iterator and
 * reading the next item in the sequence can be two separate actions.
 * That way one can pass the source to other methods and they can read
 * the last code point read without altering state.
 * 
 * @author Frank Mitchell
 */
public interface CodePointSource extends Closeable {

    /**
     * Read the current code point after the last call to {@link #next()}.
     * 
     * @return current code point.
     */
    int getCodePoint();

    /**
     * Whether this source still has code points remaining.
     * This method may read ahead to the next character, which may
     * cause an exception.
     * 
     * @return whether this object has a next code point.
     * 
     * @throws java.io.IOException if read-ahead throws an exception
     */
    boolean hasNext() throws IOException;

    /**
     * Get the next code point.
     * 
     * @throws IOException 
     */
    void next() throws IOException;
   
    /**
     * Close the underlying IO or NIO object.
     * 
     * @throws IOException from the underlying object.
     */
    @Override
    void close() throws IOException;
 }

MIT License

Copyright 2019, 2023 Frank Mitchell

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


  1. Unbeknownst to me when I started this project, JSONP’s javax.json.stream implements a pull parser too. ↩︎

  2. An optional extension is to allow multiple values on, for example, half a network socket. END_STREAM would signal the closing of that half of the socket. ↩︎

  3. Metaphorically. I doubt my skin ever loses its pallor. ↩︎