Wednesday, April 22, 2015

Taming the fielddata

Introduction

One of the bottlenecks in scaling-up Elasticsearch is fielddata. The term fielddata does not directly refer to data but represents the data structures & caching done by Elasticsearch for doing efficient look-ups. Apart from consuming considerable chunk of memory, fielddata also impacts queries. The impact is clearly visible when you have millions of documents to query. One of the recommended solutions is to scale-out. But even if you have multiple nodes, you need to make sure every single node is fine tuned otherwise you may have to keep adding considerable number of nodes as the data set increases. In following sections we will discuss how we can optimize the fielddata usage which, in turn, should help improve the memory usage & query performance for a single node.

Use doc_value = true

Main issue with fielddata is that it consumes huge memory. Using doc_value = true on a field in the mapping, tells Elasticsearch to use file system to store fielddata instead of memory. However, there is currently one limitation on string fields. Only non_analyzed string fields can have this option. What this means is that you cannot use doc_value = true with fields having analyzers like lower-case etc. defined on them. It becomes problem in some cases e.g. if you need to provide case-insensitive search for a field. Again, here the suggested approach it to use multi-fields mapping i.e. one analyzed and one non_analyzed field. But the moment we have analyzed field we cannot use doc_value = true on it which means that fielddata for it will be in-memory. To handle this we need to use transform script. In the transform script you need to do the operations like convert field to lower-case etc. In case you need to use analyzer then find out how you can invoke the analyzers explicitly from your transform script.
While the documentation says using doc_value improves memory usage, it can impact the performance. But in our testing it was observed that the aggregation query performance increased drastically. May be it was due to the fact that now fielddata was being queried from file system cache.

Handle sorting

Sorting is another feature where fielddata is used. When sort is specified on a particular field, Elasticsearch needs to find out the terms for the document i.e. it needs fielddata. So to avoid it we can use script based sorting. The script can take field name as parameter and return the field value from the _source. Even though there is overhead is accessing the source, we observed around 50% improvement using script based sorting.


Aggregation

Aggregation is the main feature where fielddata is required. Elasticsearch provides Scripted Metric Aggregation but using script did not help here. Best is to avoid aggregation queries if not required. For example, use script filter if you need to query for distinct documents.