Why is Netezza not very popular

Business Intelligence: The solution is the goal

The idea is tempting, especially when the requirements become complex: why mess with hordes of external sources, experts or “tips & tricks” when you can buy almost all of them with a special system?

The solution is called Appliance and is now offered and used in many parts of the IT industry. Whether in the consumer market with appliances for listening to music - iPod - or the domestic WLAN - for example FritzBox - or in the professional sector: specialized systems put the solution in the foreground and not the way there. Appliances not only contain all the technical standard components for the designated application, but often also special software and hardware that have been developed or refined for the application.

Netezza becomes PureData for Analytics

A few years ago, for example, IBM took over and continued to run Netezza, the manufacturer of the first real data warehouse appliance. At the same time, IBM was already developing appliances for other subject areas; the latest development is called IBM PureSystems and is a new product family from Expert Integrated Systems, which was introduced in April. In October 2012 the Netezza Appliance named itself as IBM PureData for Analytics - powered by Netezza. The name change is combined with a complete software upgrade, which will once again bring speed advantages by a factor of around 5 to 10.

The special thing about the Netezza appliance: the combination of preconfigured, performance-optimized, massively parallel hardware with specially developed hardware modules. This enables a significantly better data throughput than with comparable systems. In addition, the software optimized for this hardware - operating system and database - simplifies the creation, provision, maintenance and care of projects on the platform. The goal is clear: when developing a data warehouse (DWH) or business intelligence project (BI), the developers should not primarily focus on the performance features of the database, but rather find the performance range within certain configurations as a given. This brings the development of technical content to the fore - self-imposed thought blocks about possible implications for technical database functions such as indices, totals tables, or disk throughputs and much more disappear.

Why performance is so important

In almost all DWH or BI projects today there are discussions about speeds, be it when filling or querying, but why actually? What is it that makes them so special? One aspect: Unlike in transaction systems, in which many, sometimes thousands, users move very small amounts of data, it is often the other way around with DWH: fewer users move a large amount of data. Example of a typical query: "Sum of current and previous year's sales and the development of all customers in the last 12 months via product group xy according to sales channels". That is why the data throughput is an extremely important control variable for DWH or BI systems. The reasons for poor performance in these systems are manifold, but not insoluble. Non-appliance approaches with traditional databases offer tons of configuration options, various combinations of hardware and software, dozens of components: everything is possible - including good performance, of course. Only the effort is enormous: it has already been found in projects that Netezza SQL needs around 80 percent less object code than traditional SQL, such as that used by Oracle. The additional objects required there, such as indices and instructions on how and where data should be stored, require considerably more - objects that need to be created, maintained and further developed.

Simple projects are good projects

DWH or BI projects already tend to have a certain technical complexity. Different sources and views of the data should be standardized, often across departments and areas of responsibility. Due to the limitations of traditional approaches with conventional technologies, there is an additional complexity in the project that could easily be avoided.

Complexity only due to expected or real performance problems:

  • Additional indices accelerate access, but slow down or complicate the loading process; In addition, these must be maintained and often only affect certain queries. It is an art to define the right indices without creating too much overhead.
    In principle, no indices are used in Netezza projects. The distribution of the data to the parallel processing units, which are determined once at the beginning of the project based on the project requirements and the data situation and then occasionally refined, has a significant influence on the performance. This optimization has no effect on the storage volume, the complexity in administration or the speed during the ETL.
  • Data marts are used for the technical modeling of the data. Often, however, there is only a reduction or aggregation of the data so that users or BI tools can access the data faster and better. Data marts consume a lot of storage space, they complicate and extend the actual loading and integration process by another level. In some DWH projects, more storage space or more preparation time is used than for the actual DWH.
    In Netezza projects, the physical creation of data marts can often be dispensed with and this can be virtualized instead. Netezza is able to efficiently filter a large amount of data within a few seconds, so that reporting is possible on an ER model / foundation layer in the DWH.
  • Additional “cubes”, ie storage in multidimensional databases, are used above the data marts to further improve performance: another waste of storage space and processing time.

More intelligence meets more power

In modern BI systems, standard reporting is far from over: Ad-hoc analysis, statistics, data mining or geographic analyzes complement and expand the possibilities with which users can juggle today. Modern tools and programming languages ​​bring statistical functions up to data mining to the user. Netezza recognized this trend early on: Netezza Analytics, a free expansion of the range of functions, integrates its own and third-party analytics. The same parallel technology is used to compute ready-made data mining logic, spatial analysis and statistics. In real projects, data mining can be carried out on the entire database, statistical methods calculate the probability of purchases in a few milliseconds, and geographic calculations can be used to evaluate the effects of capricious weather on the insurance industry in seconds.

These predefined methods and interfaces are supplemented by an open programming interface with which additional logic, for example with "R", "C ++" and other languages ​​can be created. Research institutes use this, among other things, to determine connections between factors for hereditary diseases.

Convincing advantages

Complex DWH projects can be a thing of the past: Thanks to the speed and simplicity of PureData / Netezza appliances, considerable savings can be achieved in projects in the short and long term: simple installation, minimized tuning, administration and maintenance costs, simpler architectures and, last but not least, best performance .