Efficient Way to handle deep paging in Solr 4.7

Efficient Way to handle deep paging in Solr 4.7

One thing Solr has never been very efficient at is a problem that people refer to as “Deep Paging”. What is the Deep Paging?

What is Deep Paging:
Imagine the following problem – we have an application that expects Solr to return the results sorted on the basis of some field or Score. Those results will be than paged in the GUI.

However, if the person using the GUI application immediately selects the tenth, a twentieth, or fiftieth page of search results there is a problem – the wait time because of Solr need to lot of work for it.

its very simple for asking Solr for “Page #1” of a search result is very efficient, so is asking for Page #2, and Page #3, etc. for small page numbers. But the problem began when page numbers get bigger and bigger, Solr has to work harder (and use more RAM) in order to know what results to give you.

To return search results Solr must prepare an in-memory structure and return part of it. Returning the part of the structure is simple if that part comes from the beginning of the structure. However, if we want to return page number 10.000 (where we return 20 results per page) Solr needs to prepare a structure containing the minimum of 200.000 elements (10.000 * 20). You see that it not only takes time but also memory.

The key to understanding why so much RAM and time is required for “Deep Paging” is to remember that as far as client requests go, Solr is basically stateless. The only way by which Solr comes to know the next set of records is “Start” and “rows” parameter from the client.

The way a client asks for “pages” by telling Solr how many results they want to get on a page (using the rows parameter) and what position in the overall sorted list of documents the client wants the current page to start at (using the start parameter).

So for a client that wants 50 results per page, page #1 is requested using start=0&rows=50. Page #2 is start=50&rows=50, page #3 is start=100&rows=50, etc. But in order for Solr to know which 50 docs to return starting at an arbitrary point N, it needs to build up an internal queue of the first N+50 sorted documents matching the query, so that it can then throw away the first N docs, and return the remaining 50. This means the amount of memory needed to return paginated results grows linearly with the increase in the start param.

In the case of SolrCloud, the problem gets even worse, because the N+50 top documents need to be collected on every shard, and the sort values from every shard need to be streamed over the network to the coordination node (the one that received the initial request from the end client) to merge them.

Following are some way around for it:
1. we can try to set the cache or the size of queryResultWindowSize, but there will be a problem of how to set the size, there may be a situation where it will be insufficient or not relevant entry in the memory of Solr and then waiting time for the n-th search page will be very long. We can also try adding warming queries, but we won’t be able to prepare all the combinations, but even if we could the cache would have to be big. So we won’t be able to achieve the desired results with any of these solutions.

2. Using the fq (filter query) that uses the sort field values from the last document retrieved in range queries. for example, consider the following query params:

curl 'localhost:8983/solr/select?q=foo&fl=id,name,score&sort=id asc&start=0&rows=500'

Assuming id is unique for every document, then if the last document returned by that query has an id of AAA123 the next “page” or results could be fetched by modifying the query to include an fq on the id field:

curl 'localhost:8983/solr/select?q=foo&fl=id,name,score&sort=id asc&fq=id:{AAA123 TO *]&start=0&rows=500'

Deep Paging: http://solr.pl/en/2011/07/18/deep-paging-problem/

Deep paging solution in Solr 4.7:

The good thing is, that with the release of Solr 4.7 the situation had changed – the cursor has been introduced. A cursor is a logic structure, that doesn’t require its state to be stored on the server side. Cursor contains information about storing and lest document returned in the results. Because of that, Solr doesn’t need to start the search from beginning each time we want to get the next page of results. It results in drastic performance improvement when using a cursor and going deep into results.

Cursor usage is very simple. To tell Solr to return a cursor, in the first query we need to pass an additional parameter – cursorMark=*. In result, apart from documents, we will get a cursor identifier returned in the nextCursorMark parameter. Let’s look at the example.

Let’s start with a very simple query:

curl 'localhost:8983/solr/select?q=*:*&rows=1&sort=score+desc,id+asc&cursorMark=*'

There are four things here that we are interested in. First, of off, we either omit the start parameter or we set it to 0. The rows parameter can take the values we need, there is no limitation on it. Of course, we passed the cursorMark=* parameter, to tell Solr that we want the cursor to be used. The final thing we did is sorting definition. We need to define sorting for a cursor to be working, one that will tell cursor how to behave. That’s why we needed to overwrite default sorting and include sorting not only by the score, by also by document identifier.

We will get the cursor identifier in the nextCursorMark section which we can use in following queries, next query looks like:
curl 'localhost:8983/solr/select?q=*:*&rows=1&sort=score+desc,id+asc&cursorMark=AoIIP4AAACgwNTc5QjAwMg=='

A logic for further queries is simple – we use the cursorMark parameter with the value returned with the previous search results. So again, our next query would look as follows:
curl 'localhost:8983/solr/select?q=*:*&rows=1&sort=score+desc,id+asc&cursorMark=AoIIP4AAACoxMDAtNDM1ODA1'