Performance (QueryAPI)
Select Only the Necessary Columns
The more columns included in the query results, the larger the volume of data that must be read from storage, transmitted over the network from the database, and loaded into the application's memory. Therefore, it is essential to specify only the columns that are truly needed for your work.
For instance, if you need to work with 6 columns from a table that has a total of 30 columns, failing to specify which columns to retrieve means that approximately 80% of the data (depending on the individual column lengths) transmitted over the network and loaded into memory will be unnecessary, consuming both memory and network resources.
This issue is exacerbated when multiple users or numerous line items perform such queries repeatedly, potentially leading to server overload or crashes. The risk is even greater when executing JOINs, as additional columns are added to the results. Therefore, it is crucial to always specify the columns to be retrieved.
Join Criteria on All Key Fields
Joining tables can significantly speed up gathering data needed for calculation, but only if the database can use index to lookup this data. Therefore, it is strongly recommended, to specify all key fields or majority of the key fields (left-to-right) in the joining criteria. Remember that the order of the key fields of index matters, so searching just for firstname in the phonebook indexed on [surname, firstname] causes full table scan that will significantly prolong the operation.
Sorting
If you need to perform sorting on big data, it should be done on an indexed field, otherwise it will take very long time.
Fast Processing of Stream Row
Fetching the rows from a table blocks the database connection. Therefore the processing of a row should be as fast as possible.
Avoid Loading Large Volumes of Rows into Memory
Collecting a large number of rows into memory using collect { it } for later processing can lead to an out-of-memory exception. This may seem counterintuitive, as it contradicts the previous recommendation to process rows as quickly as possible. It may appear that loading all rows into memory and then processing them would be the fastest approach. However, this is not always the case.
While reading a small number of rows from a small table may work efficiently, this process can fail dramatically on a production (PROD) environment where the same table contains millions of rows. What works well on a quality assurance (QA) setup, where the table may only have a few rows, can lead to issues in a live environment.
Additionally, loading more data into memory will slow down performance, as the Java Virtual Machine's Garbage Collector must work increasingly harder to free up memory.
So, what is the solution? Process rows in batches. Load only a fixed number of rows into memory (experiment with different settings to determine an optimal quantity), process these rows, clear the list or map into which they were read, and then proceed to the next batch. This approach helps manage memory usage effectively while maintaining performance.
Found an issue in documentation? Write to us.