Street map address collections in software represent an interesting case for customized data structures. The following article explains a memory optimization strategy for when large collections of records consist of highly repetitive string components.
Learning Points
- Why memory is sometimes wasted on redundant string data.
- Suggested solution when the input is known to contain highly repetitive string data.
- Why to extend the base collections in Java when creating custom data structures.
The Task
We have the case of programming a street map application. It will parse a large file from OpenStreetMap, with several million street addresses to store.
A (fictional) address from the dataset could look roughly similar to this:
{
"lat": 55.0300200,
"lon": 12.5030440,
"street": "Harbor Street",
"housenumber": "123A",
"city": "Copenhagen South",
"country": "DK",
"postcode": "1223"
}
Defining the Data Object
It is rather easy to make out how to create this within the object-oriented paradigm. Define a public record Address
in code:
public record Address(
double lat,
double lon,
String street,
String housenumber,
String city,
String country,
String postcode
)
{}
Records include get methods for all component values and override the equals(...)
and hashCode()
methods to be value-based by default.
Using a Collection
We want to add the addresses to some type of Collection for later use. In this example, we want the data type to be a list of addresses. In code:
List<Address> addressList;
This type is defined by the List interface in Java. The implementation, based on some data structure, must support all the methods defined in this interface. This is the “Dependency Inversion Principle” of SOLID in action.
ArrayList
is one such implementation. It implements List
as an array of reference pointers to the data, which is maintained, modified and replaced by instance method calls as needed.
List<Address> addressList = new ArrayList<>();
// Add records to list:
var address = new Address(/* Values go here */);
addressList.add(address);
// Get the record at index n:
Address a = addressList.get(n);
Issue - Repetition in String Data
Before you start adding millions of addresses from parsed input data, do please consider this simple optimization issue.
Take the following array of addresses, in which only the house number changes as we go down the street:
[
{
"street": "Harbor Street",
"housenumber": "123A",
"city": "Copenhagen South",
"country": "DK",
"postcode": "1223"
},
{
"street": "Harbor Street",
"housenumber": "125",
"city": "Copenhagen South",
"country": "DK",
"postcode": "1223"
},
{
"street": "Harbor Street",
"housenumber": "127",
"city": "Copenhagen South",
"country": "DK",
"postcode": "1223"
}
]
Remember that Strings
are an immutable value-based type. To represent the data above, only the unique set of strings would need to be stored in memory. These are here in order of appearance:
{
"Harbor Street", "123A", "Copenhagen South", "DK", "1223", "125", "127"
}
// 7 strings, 45 characters.
However, if we were to create and keep instances of Address
based on data from most parsers, we may actually store this many strings in memory:
{
"Harbor Street", "123A", "Copenhagen South", "DK", "1223",
"Harbor Street", "125", "Copenhagen South", "DK", "1223",
"Harbor Street", "127", "Copenhagen South", "DK", "1223",
}
// 15 strings, 115 characters!
This is in the worst case, with every address parsed creating an entirely new set of strings.
Why does this happen?
For performance reasons, parsers of XML, JSON, etc. will often return String
instances created directly from the input stream data. It would be far slower to have to compare all of the input to some cache of strings already on hand. The trade-off is that we will use some more memory storing redundant strings.
In the case of Denmark, each of its 2.6 million addresses can be represented as a regular expression of a street, a number, and a postcode-city pair. This comes together to be little more than 60,000 distinct strings. 1
Simply passing on all of the Address
objects to a collection will not optimize this. All of the underlying String
instances will continue to be in use throughout the runtime.
Approach to String Caching
In anticipation of a dataset containing a lot of redundant strings, we can choose to cache distinct strings using a custom collection implementation, or as part a wrapper class for a collection.
For the cache, use a fast Map<Key, Value>
implementation. The Key
will always be String
for comparing to incoming data. The value will either be the distinct String
that we store, or an integer pointing to a String[]
or List<String>
of distinct strings. Store the values instead of the incoming Address
objects, either in arrays or as new Address
instances in an underlying collection.
For Java I recommend extending the abstract type of collection that is to be implemented. This assists with getting collection behavior specific to Java correct. E.g. to develop a new type of List
, extend the AbstractList class.
In any case:
- Recognize that
Address
is a value-based type. The implemented collection should store the value that anAddress
represents. Instances are replaceable so long as the value is retained in the collection. - Values can be represented imperatively within some processes or collections for performance reasons. This applies even if the interface will be object-oriented. Using index numbers instead of pointer references, using arrays to store coordinates, etc. Choosing a programming paradigm is a matter of experience, and will vary for different applications.
-
I found this through experiments on OSM-data during my project. Try confirming this yourself. ↩︎