Jackson CSV Serialization and Deserialization from the Ground Up

While there are many examples of Jackson serialization to JSON, there are comparatively few resources of Jackson serialization to CSV. Following is an example of working with a TSV-formatted dataset from the ground up, starting with creating the model object, building code samples for parsing CSV to Java objects using Jackson, writing Java objects to CSV using Jackson, and ending with code to a full round-trip test for serialization.

This example also uses a Java record class (as opposed to traditional Java classes) for modeling the DTO, so the example also shows off how to use a Java record with Jackson, too.

Understanding the Data

This example uses the “alternate names” dataset from geonames.org. A geonames.org alternate name record has the following fields, per the Geonames readme file:

The table 'alternate names' :
-----------------------------
alternateNameId : the id of this alternate name, int
geonameid : geonameId referring to id in table 'geoname', int
isolanguage : iso 639 language code 2- or 3-characters, optionally followed by a hyphen and a countrycode for country specific variants (ex:zh-CN) or by a variant name (ex: zh-Hant); 4-characters 'post' for postal codes and 'iata','icao' and faac for airport codes, fr_1793 for French Revolution names, abbr for abbreviation, link to a website (mostly to wikipedia), wkdt for the wikidataid, varchar(7)
alternate name : alternate name or name variant, varchar(400)
isPreferredName : '1', if this alternate name is an official/preferred name
isShortName : '1', if this is a short name like 'California' for 'State of California'
isColloquial : '1', if this alternate name is a colloquial or slang term. Example: 'Big Apple' for 'New York'.
isHistoric : '1', if this alternate name is historic and was used in the past. Example 'Bombay' for 'Mumbai'.
from : from period when the name was used
to : to period when the name was used

The full dataset is freely available at download.geonames.org, under the filename alternateNames.zip. The first few lines of the file look like this:

10552347	10071541	sv	Vitharuna				
15654026 874380 wkdt Q31583993
8471225 8456578 hy Սալք Լեռ

Note that the file is in TSV format, and many fields are blank.

Setting Up the Project

To prepare for performing CSV de/serialization, add the following dependencies to your Maven POM:

<!-- This dependency is required to define the model object. -->
<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-annotations</artifactId>
  <version>${jackson.version}</version>
</dependency>

<!-- These dependencies are required to perform the actual -->
<!-- serialization and deserialization of model objects.   -->
<dependency>
  <groupId>com.fasterxml.jackson.dataformat</groupId>
  <artifactId>jackson-dataformat-csv</artifactId>
  <version>${jackson.version}</version>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.datatype</groupId>
  <artifactId>jackson-datatype-jdk8</artifactId>
  <version>${jackson.version}</version>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.datatype</groupId>
  <artifactId>jackson-datatype-jsr310</artifactId>
  <version>${jackson.version}</version>
</dependency>

At the time of this writing, the latest version of Jackson is 2.18.2. The preferred method for version management is to use the Jackson BOM.

Creating the Data Model

This example will use a data transfer object (DTO) for reading and writing records. Since this is “plain ol’ data”, we’ll use a Java record to represent an alternate name record. Here’s the complete DTO implementation:

@JsonPropertyOrder({"alternate_name_id", "geonameid", "isolanguage", "alternate_name",
    "is_preferred_name", "is_short_name", "is_colloquial", "is_historic", "from", "to"})
public record AlternateName(@JsonProperty("alternate_name_id") String alternateNameId,
    @JsonProperty("geonameid") String geonameid, @JsonProperty("isolanguage") String isolanguage,
    @JsonProperty("alternate_name") String alternateName,
    @JsonProperty("is_preferred_name") @JsonDeserialize(
        using = BooleanGeonamesDeserializer.class) @JsonSerialize(
            using = BooleanGeonamesSerializer.class) boolean preferredName,
    @JsonProperty("is_short_name") @JsonDeserialize(
        using = BooleanGeonamesDeserializer.class) @JsonSerialize(
            using = BooleanGeonamesSerializer.class) boolean shortName,
    @JsonProperty("is_colloquial") @JsonDeserialize(
        using = BooleanGeonamesDeserializer.class) @JsonSerialize(
            using = BooleanGeonamesSerializer.class) boolean colloquial,
    @JsonProperty("is_historic") @JsonDeserialize(
        using = BooleanGeonamesDeserializer.class) @JsonSerialize(
            using = BooleanGeonamesSerializer.class) boolean historic,
    @JsonProperty("from") @JsonFormat(shape = JsonFormat.Shape.STRING,
        pattern = "yyyy-MM-dd") Optional<LocalDate> from,
    @JsonProperty("to") @JsonFormat(shape = JsonFormat.Shape.STRING,
        pattern = "yyyy-MM-dd") Optional<LocalDate> to) {
  public AlternateName {
    // Validate constraints
    if (alternateNameId == null)
      throw new NullPointerException();
    if (geonameid == null)
      throw new NullPointerException();
    if (isolanguage == null)
      throw new NullPointerException();
    if (alternateName == null)
      throw new NullPointerException();
    if (from == null)
      throw new NullPointerException();
    if (to == null)
      throw new NullPointerException();
    if (from.isPresent() != to.isPresent()) {
      throw new IllegalArgumentException("from and to must be both null or both not null");
    }
    if (from.isPresent() && from.get().isAfter(to.get())) {
      throw new IllegalArgumentException("from must be before to");
    }
  }
}

Unpacking the Data Model Design

There are a lot of choices and details in the design of this model class. Let’s unpack them and discuss the reasoning.

Design Choices

Optional fields

Some fields in Geonames alternate names are optional, i.e., they may or may not have values.

The DTO uses Optional-valued fields to indicate optional fields. From a design perspective, this makes it immediately obvious that the non-Optional valued fields will always return meaningful non-null values, whereas the Optional-valued fields will always return non-null values, but they may not always be populated.

For an internal-facing DTO, returning null for unpopulated optional fields is a reasonable choice, as long as this is carefully documented. However, for external-facing DTOs, Optional fields are preferred, absent performance issues.

Jackson Annotations

The implementation makes heavy use of Jackson annotations.

The @JsonPropertyOrder annotation

In order to de/serialize CSV data, Jackson has to know what order the fields appear in within each row. This can be given on the class in the form of an @JsonPropertyOrder annotation, or through a separate CsvSchema object. This implementation uses the @JsonPropertyOrder annotation to keep the information about headers and order centralized. This is not a mixing of application layers because the class is a DTO explicitly for the purposes of serialzation.

No @JsonCreator annotation

Since we’re using a Java record with a sole canonical constructor, we don’t need to use @JsonCreator on the constructor, like we would in a traditional Java class. Neat!

The @JsonProperty annotations on constructor parameters

The @JsonProperty annotations on the constructor parameters map the logical serialization fields to physical parameters, just like they would in a traditional @JsonCreator constructor. Note that the logical field names given in these annotations must match the logical field field names in the @JsonPropertyOrder annotation.

No other @JsonProperty annotations

Since we’re using a Java record, we only need to annotate the constructor parameters. In a traditional Java class, we might also have to annotate fields and/or accessor methods, depending on the logical serialization field names we chose.

Jackson would also pick up annotations on any accessor methods, if we created them, even though we’re using a record class.

The @JsonSerialize and @JsonDeserialize annotations

The @JsonSerialize and @JsonDeserialize annotations on some parameters tell Jackson to use custom de/serializers (as opposed to default de/serializers) to de/serialize the logical fields values into the parameter values.

This is required to match the Geonames TSV format. For example, in the TSV format, boolean values are serialized as "1" for true, and the empty string for false. The standard serializers would user true and false, respectively.

The following custom de/serializer implementations handle the customization for the boolean type:

public class BooleanGeonamesSerializer extends JsonSerializer<Boolean> {
  @Override
  public void serialize(Boolean value, JsonGenerator gen, SerializerProvider serializers)
      throws IOException {
    if (value == null || !value) {
      gen.writeString("");
    } else {
      gen.writeString("1");
    }
  }
}
public class BooleanGeonamesDeserializer extends JsonDeserializer<Boolean> {
  @Override
  public Boolean deserialize(JsonParser p, DeserializationContext ctxt) throws IOException {
    String text = p.getText();
    if (text == null || text.isBlank())
      return false;
    return text.equals("1");
  }
}

The other custom de/serializers are substantially similar.

The @JsonFormat annotations

The @JsonFormat annotations on the LocalDate fields tell Jackson to de/serialize them as a string in ISO-6801 format, as opposed to as timestamps, which is the Jackson default.

Serialization of the Data Model

Adding Jackson annotations to the model object allows for simple serialization to and from TSV using Jackson.

Deserializing Data Using the Data Model

The following code deserializes a list of model objects from a byte stream:


public static final CsvMapper MAPPER=(CsvMapper) new CsvMapper()
  .registerModule(new Jdk8Module())
  .registerModule(new JavaTimeModule());

public static final CsvSchema ALTERNATE_NAME_TSV_SCHEMA=MAPPER
  .schemaFor(AlternateName.class)
  .withoutHeader()
  .withoutQuoteChar()
  .withColumnSeparator('\t');

public static List<AlternateName> deserializeAlternateNames(
    InputStream in) throws IOException {
  List<AlternateName> result=new ArrayList<>();

  try (final MappingIterator<AlternateName> iterator =
      CSV_MAPPER.readerFor(AlternateName.class)
        .with(ALTERNATE_NAME_TSV_SCHEMA)
        .readValues(new InputStreamReader(
          bytes.openStream(),
          StandardCharsets.UTF_8))) {
    while(iterator.hasNext()) {
      result.add(iterator.next());
    }
  }

  return unmodifiableList(result);
}

Serializing Data Using the Data Model

The following code serializes a collection of model objects to a byte stream:

public static final CsvMapper MAPPER=(CsvMapper) new CsvMapper()
  .registerModule(new Jdk8Module())
  .registerModule(new JavaTimeModule());

public static final CsvSchema ALTERNATE_NAME_TSV_SCHEMA=MAPPER
  .schemaFor(AlternateName.class)
  .withoutHeader()
  .withoutQuoteChar()
  .withColumnSeparator('\t');

public static void serializeAlternateNames(
    OutputStream out,
    Collection<AlternateName> alternateNames) throws IOException {
  try (SequenceWriter w = 
      MAPPER
        .writerFor(AlternateName.class)
        .with(ALTERNATE_NAME_TSV_SCHEMA)
        .writeValues(new OutputStreamWriter(
          out,
          StandardCharsets.UTF_8))) {
    for (AlternateName alternateName : alternateNames) {
      w.write(alternateName);
    }
}

Serialization Testing

The following (simple) test verifies that the serialization code works. Of course, reading from the official alternate names dataset would be best, but this is enough to demonstrate the concept.

public static final CsvMapper MAPPER=(CsvMapper) new CsvMapper()
  .registerModule(new Jdk8Module())
  .registerModule(new JavaTimeModule());

public static final CsvSchema ALTERNATE_NAME_TSV_SCHEMA=MAPPER
  .schemaFor(AlternateName.class)
  .withoutHeader()
  .withoutQuoteChar()
  .withColumnSeparator('\t');

@Test
public void testCsvRoundTrip() throws Exception {
    // Define the original dataset
    List<AlternateName> originalData = List.of(
        new AlternateName("10552347", "10071541", "sv", "Vitharuna", false, false, false, false, Optional.empty(), Optional.empty()),
        new AlternateName("13522193", "3361432", "en", "Little Smith's Winkel Bay", false, false, false, true, Optional.of(LocalDate.of(2020, 1, 1)), Optional.of(LocalDate.of(2023, 12, 31))));

    // Serialize the dataset
    StringWriter writer = new StringWriter();
    MAPPER
        .writerFor(AlternateName.class)
        .with(ALTERNATE_NAME_TSV_SCHEMA)
        .writeValues(writer)
        .writeAll(originalData);
    String actualSerializedForm = writer.toString();

    // Ensure the actual serialized form looks right
    String expectedSerializedForm = """
        10552347\t10071541\tsv\tVitharuna\t\t\t\t\t\t
        13522193\t3361432\ten\tLittle Smith's Winkel Bay\t\t\t\t1\t2020-01-01\t2023-12-31
        """;
    assertEquals(expectedSerializedForm, actualSerializedForm);

    // Deserialize the serialized form
    StringReader reader = new StringReader(actualSerializedForm);
    List<AlternateName> deserializedData = MAPPER
        .readerFor(AlternateName.class)
        .with(ALTERNATE_NAME_TSV_SCHEMA)
        .<AlternateName>readValues(reader)
        .readAll();

    // Make sure the deserialized data matches the original data
    assertEquals(originalData, deserializedData);
}

Takeaways

  • Java records: Modern Java records simplify the creation of data classes for structured data like CSV records, improving code readability and maintainability.
  • Jackson annotations: Annotations such as @JsonProperty, @JsonFormat, and @JsonSerialize enable precise control over how data is serialized and deserialized for both JSON and CSV.
  • Custom serializers/deserializers: When working with non-standard formats, custom serializers and deserializers allow you to tailor the handling of specific fields to meet your needs.
  • Validation at the model level: Adding validation logic within your data model ensures data integrity and reduces potential runtime issues.
  • Comprehensive testing: Round-trip serialization and deserialization tests help verify that your implementation handles real-world scenarios correctly.
  • Efficient CSV handling: Jackson’s CSV module, paired with Java’s latest features, makes working with even complex CSV datasets cleaner and more efficient.

Conclusion


Jackson, combined with Java records, simplifies the process of handling structured data like Geonames’ alternate name records. By leveraging annotations, custom serializers, and a robust validation approach, you can create clean, maintainable solutions for complex CSV datasets. Whether you’re working with GeoNames or a similar format, these techniques provide a reliable foundation for efficient data processing.

Leave a Reply

Your email address will not be published. Required fields are marked *