Introduction
One of the bottlenecks in scaling-up Elasticsearch is
fielddata. The term
fielddata does not directly refer to data but represents the data structures & caching done by Elasticsearch for doing efficient look-ups. Apart from consuming considerable chunk of memory,
fielddata also impacts queries. The impact is clearly visible when you have millions of documents to query. One of the recommended solutions is to scale-out. But even if you have multiple nodes, you need to make sure every single node is fine tuned otherwise you may have to keep adding considerable number of nodes as the data set increases. In following sections we will discuss how we can optimize the
fielddata usage which, in turn, should help improve the memory usage & query performance for a single node.
Use doc_value = true
Main issue with
fielddata is that it consumes huge memory. Using
doc_value = true on a field in the mapping, tells Elasticsearch to use file system to store
fielddata instead of memory. However, there is currently one limitation on string fields. Only non_analyzed string fields can have this option. What this means is that you cannot use doc_value = true with fields having analyzers like lower-case etc. defined on them. It becomes problem in some cases e.g. if you need to provide case-insensitive search for a field. Again, here the suggested approach it to use
multi-fields mapping i.e. one analyzed and one non_analyzed field. But the moment we have analyzed field we cannot use doc_value = true on it which means that
fielddata for it will be in-memory. To handle this we need to use
transform script. In the transform script you need to do the operations like convert field to lower-case etc. In case you need to use analyzer then find out how you can invoke the analyzers explicitly from your transform script.
While the documentation says using doc_value improves memory usage, it can impact the performance. But in our testing it was observed that the aggregation query performance increased drastically. May be it was due to the fact that now
fielddata was being queried from file system cache.
Handle sorting
Sorting is another feature where
fielddata is used. When sort is specified on a particular field, Elasticsearch needs to find out the terms for the document i.e. it needs
fielddata. So to avoid it we can use
script based sorting. The script can take field name as parameter and return the field value from the _source. Even though there is overhead is accessing the source, we observed around 50% improvement using script based sorting.
Aggregation
Aggregation is the main feature where
fielddata is required. Elasticsearch provides
Scripted Metric Aggregation but using script did not help here. Best is to avoid aggregation queries if not required. For example, use
script filter if you need to query for distinct documents.