On Address Data - Issues and Optimizations

Street map address collections in software represent an interesting case for customized data structures. The following article explains a memory optimization strategy for when large collections of records consist of highly repetitive string components.

Learning Points

Why memory is sometimes wasted on redundant string data.
Suggested solution when the input is known to contain highly repetitive string data.
Why to extend the base collections in Java when creating custom data structures.

The Task

We have the case of programming a street map application. It will parse a large file from OpenStreetMap, with several million street addresses to store.

A (fictional) address from the dataset could look roughly similar to this:

{
    "lat": 55.0300200,
    "lon": 12.5030440,
    "street": "Harbor Street",
    "housenumber": "123A",
    "city": "Copenhagen South",
    "country": "DK",
    "postcode": "1223"
}

Defining the Data Object

It is rather easy to make out how to create this within the object-oriented paradigm. Define a public record Address in code:

public record Address(
    double lat, 
    double lon, 
    String street, 
    String housenumber, 
    String city, 
    String country, 
    String postcode
) 
{}

Records include get methods for all component values and override the equals(...) and hashCode() methods to be value-based by default.

Using a Collection

We want to add the addresses to some type of Collection for later use. In this example, we want the data type to be a list of addresses. In code:

List<Address> addressList;

This type is defined by the List interface in Java. The implementation, based on some data structure, must support all the methods defined in this interface. This is the “Dependency Inversion Principle” of SOLID in action.

ArrayList is one such implementation. It implements List as an array of reference pointers to the data, which is maintained, modified and replaced by instance method calls as needed.

List<Address> addressList = new ArrayList<>();

// Add records to list:
var address = new Address(/* Values go here */);
addressList.add(address);

// Get the record at index n:
Address a = addressList.get(n);

Issue - Repetition in String Data

Before you start adding millions of addresses from parsed input data, do please consider this simple optimization issue.

Take the following array of addresses, in which only the house number changes as we go down the street:

[
    {
        "street": "Harbor Street",
        "housenumber": "123A",
        "city": "Copenhagen South",
        "country": "DK",
        "postcode": "1223"
    },
    {
        "street": "Harbor Street",
        "housenumber": "125",
        "city": "Copenhagen South",
        "country": "DK",
        "postcode": "1223"
    },
    {
        "street": "Harbor Street",
        "housenumber": "127",
        "city": "Copenhagen South",
        "country": "DK",
        "postcode": "1223"
    }
]

Remember that Strings are an immutable value-based type. To represent the data above, only the unique set of strings would need to be stored in memory. These are here in order of appearance:

{
    "Harbor Street", "123A", "Copenhagen South", "DK", "1223", "125", "127"
}
// 7 strings, 45 characters.

However, if we were to create and keep instances of Address based on data from most parsers, we may actually store this many strings in memory:

{
    "Harbor Street", "123A", "Copenhagen South", "DK", "1223",
    "Harbor Street", "125", "Copenhagen South", "DK", "1223",
    "Harbor Street", "127", "Copenhagen South", "DK", "1223",
}
// 15 strings, 115 characters!

This is in the worst case, with every address parsed creating an entirely new set of strings.

Why does this happen?

For performance reasons, parsers of XML, JSON, etc. will often return String instances created directly from the input stream data. It would be far slower to have to compare all of the input to some cache of strings already on hand. The trade-off is that we will use some more memory storing redundant strings.

In the case of Denmark, each of its 2.6 million addresses can be represented as a regular expression of a street, a number, and a postcode-city pair. This comes together to be little more than 60,000 distinct strings. ¹

Simply passing on all of the Address objects to a collection will not optimize this. All of the underlying String instances will continue to be in use throughout the runtime.

Approach to String Caching

In anticipation of a dataset containing a lot of duplicated text snippets, we can choose to cache distinct strings using a custom collection implementation, or as part a wrapper class for a collection.

For the cache, use a fast Map<Key, Value> implementation. The Key will always be String for comparing to incoming data. The value will either be the distinct String that we store, or an integer pointing to a String[] or List<String> of distinct strings. Store the values instead of the incoming Address objects, either in arrays or as new identical Address instances in an underlying collection.

For Java I recommend extending the abstract type of collection that is to be implemented. This assists with getting collection behavior specific to Java correct. E.g. to develop a new type of List, extend the AbstractList class.

In any case:

Recognize that Address is a value-based type. The implemented collection should store the value that an Address represents. One Instance can take the place of another so long as the value is the same.
Values can be represented imperatively within some processes or collections for performance reasons. This applies even if the interface will be object-oriented. Using index numbers instead of pointer references, using arrays to store coordinates, etc. Choosing a programming paradigm is a matter of experience, and will vary for different applications.

I found this through experiments on OSM-data during my project. Try confirming this yourself. ↩︎

Learning Points#

The Task#

Defining the Data Object#

Using a Collection#

Issue - Repetition in String Data#

Why does this happen?#

Approach to String Caching#