Knowledge Base (Display) Knowledge Base (Display)

Report data volume

This article adresses a common Birt report design mistake, related to large data sources

Overview

Visioneo reportlet is designed to fit in an existing architecture, without adding any further resources. Your Birt reports must be developed so that they won't struggle portal servers. Build small, agile, highly parameterizable reports is the most important way to achieve that. This would obviously require a whole book and much debates, the article addresses a very common design mistake.

When your reports have to deal with large data sources, there are several ways to manage filters and aggregations. We must distinguish:

  • Data source volume
  • Data volume returned by your SQL / ODA query
  • Data volume returned by your dataset
  • Data volume returned by your report elements (tables, cross-tables, charts)

Note there should not be any correlation between a data source volume and a report volume: for example a report can use a data source with 1 billion rows , and aggregate results in a single output cell. Depending on where report filters and aggregations are applied, this will change everything in your project architecture and performances!

Report sample

Consider the report sample below, a cross-table with about 500 output cells. It is using the default Birt "Classic models" database. But for this article, let's imagine it is based on a 20 millions rows table, aggregated and filtered to produce this result.

Main filters step are:

  • Filter on report parameters: for example Products category, Offices and year Y & Y-1. Let's say once these filters are applied, in our example there are still 1 million rows to process
  • Filter final output on year Y.  The Year Y-1 is included in the filter above to allow time series calculations, but is not displayed 

And main aggregations are:

  • Aggegrate to the report output level (Month, Group, Product Category). In our example (7 categories * 12 months * 2 years) it results in about 200 rows.
  • Time series calculations (Year-to date Y & Y-1, 12 months rolling, ...)
  • Totals (by month, by category)

Following scenarios illustrate where these operations can be processed in a Birt report design.

Scenarios

As you can see, your processor cores are knocked out in scenario 1, out of breath in scenario 2, work pretty well in scenario 3 and enjoy scenario 4.

The central idea is to get rid of volume at the earliest possible stage. Your Database Management System has many indexes, caches, statistics informations, partitions, etc. to optimize and parallelize your filters and aggregations. A birt engine does not have such things, and performances will dramatically decrease with large volumes

Therefore, you should always process large volumes in your report dataset sql statements :

  • Filter in a "where" sql clause using birt report parameters rather than in dataset / cross-tables filters
  • Aggregate in "group by" sql clause rather than in datacube aggregation / table groups. 

Of course there are huge differences between DBMS, for example

  • Oracle can generate transparent aggregates on fact tables,
  • Infobright or Sybase IQ are  columnar databases and will evaluate large queries in a sub-second time  
  • Teradata is the master of intra/inter queries parallelization
  • etc.

But even with a simple DBMS such MySQL, by applying filters and aggregations in sql queries, you will always be able to easily optimize Birt reports by building an aggregate table and indexes, for example. Furthermore, you should always consider using Stored Procedures for large/complexe queries, such described in the article "Stored procedures vs SQL queries".

With this simple principle you won't struggle with performances issues, and will be able to focus on Birt added values features such time series caculations, derived measures, report parameters, formatting etc. No need to have millions rows to be concerned by this article! Even if your reports handle a few thousands rows, you will always take advantage of reducing an upstream volume.

Limit per query

Visioneo reportlet has a parameter to protect your server against too large report datasets. In <reportlet webapp root>/WEB-INF/portlet.xml:

<init-param>

<name>maxRowQuery</name>

<value>20000</value>

</init-param>

This parameter prevents a dataset to receive more than 20000 rows per query. If this limit is exceeded, no exception is thrown: the report is normally processed with the first 20000 rows received. That means with our fictive example above, scenarios 1 & 2 would only process 0,1% data, when scenarios 3 & 4 would work as expected.

You can increase this value,  but if you have carefully read the article this may not be something you want to do! Add a required parameter or aggregations in your dataset sql instead! Remember there is no direct correlation between your data source volume and the volume you really need to extract.

In practice, well-designed reports which absolutely have to get more than 20000 rows are very rare, whatever your company size is. Most of the time, it is required when Birt is used as a data extractor, to produce a csv / excel export to load some local databases. 

Tags: architecture developers
Average (0 Votes)
Most Recent