Street map address collections in software represent an interesting case for customized data structures. The following article explains a memory optimization strategy for when large collections of records consist of highly repetitive string components.

Learning Points

  • Why memory is sometimes wasted on redundant string data.
  • Suggested solution when the input is known to contain highly repetitive string data.
  • Why to extend the base collections in Java when creating custom data structures.

The Task

We have the case of programming a street map application. It will parse a large file from OpenStreetMap, with several million street addresses to store.

A (fictional) address from the dataset could look roughly similar to this:

{
    "lat": 55.0300200,
    "lon": 12.5030440,
    "street": "Harbor Street",
    "housenumber": "123A",
    "city": "Copenhagen South",
    "country": "DK",
    "postcode": "1223"
}

Defining the Data Object

It is rather easy to make out how to create this within the object-oriented paradigm. Define a public record Address in code:

public record Address(
    double lat, 
    double lon, 
    String street, 
    String housenumber, 
    String city, 
    String country, 
    String postcode
) 
{}

Records include get methods for all component values and override the equals(...) and hashCode() methods to be value-based by default.

Using a Collection

We want to add the addresses to some type of Collection for later use. In this example, we want the data type to be a list of addresses. In code:

List<Address> addressList;

This type is defined by the List interface in Java. The implementation, based on some data structure, must support all the methods defined in this interface. This is the “Dependency Inversion Principle” of SOLID.

ArrayList is one such implementation. It implements List as an array of reference pointers to the data, which is maintained, modified and replaced by instance method calls as needed.

List<Address> addressList = new ArrayList<>();

// Add records to list:
var address = new Address(/* Values go here */);
addressList.add(address);

// Get the record at index n:
Address a = addressList.get(n);

Issue - Repetition in String Data

Take the following array of addresses, in which only the house number changes as we go down the street:

[
    {
        "street": "Harbor Street",
        "housenumber": "123A",
        "city": "Copenhagen South",
        "country": "DK",
        "postcode": "1223"
    },
    {
        "street": "Harbor Street",
        "housenumber": "125",
        "city": "Copenhagen South",
        "country": "DK",
        "postcode": "1223"
    },
    {
        "street": "Harbor Street",
        "housenumber": "127",
        "city": "Copenhagen South",
        "country": "DK",
        "postcode": "1223"
    }
]

Remember that Strings are an immutable value-based type. To represent the data above, only the unique set of strings would need to be stored in memory. These are here in order of appearance:

{
    "Harbor Street", "123A", "Copenhagen South", "DK", "1223", "125", "127"
}
// 7 strings, 45 characters.

However, if you were to create and keep instances of Address based on data from most parsers, you may (in the worst case) actually store this many strings in memory:

{
    "Harbor Street", "123A", "Copenhagen South", "DK", "1223",
    "Harbor Street", "125", "Copenhagen South", "DK", "1223",
    "Harbor Street", "127", "Copenhagen South", "DK", "1223",
}
// 15 strings, 115 characters!

Why does this happen?

For performance reasons, parsers of XML, JSON, etc. will often return strings created directly from the input stream data. It would be far slower to have to compare all of the input to some cache of strings already on hand.

However, some datasets such as street addresses consist of a highly regular and limited set of strings. In the case of Denmark, each of its 2.6 million addresses can be represented as a regular expression of a street, a number, and a postcode-city pair. The components come together to be little more than 60,000 distinct strings. 1

Unfortunately, we end up keeping all of the strings referenced by address records, even if multiple strings contain identical data.

Approaches to String Caching

In anticipation of a dataset containing a lot of redundant strings, we may opt in on caching strings in a custom implementation of List<Address>.

In this implementation, we would keep a set or map of unique strings. Any incoming address objects would be discarded. Instead, the value of each object would be saved as a new record or in some other structure. Only the first instance of every string value would be referenced, others dereferenced for the garbage collector.

  • This could be implemented as a wrapper class for a list.
  • You may opt to create a new implementation entirely, in order to apply further optimizations. The simplest way to do this is to extend the AbstractList class.

  1. I found this through experiments on OSM-data during my project. Try confirming this yourself. ↩︎