Exploring Partition-wise Join Optimization in/with Partitioned Union

Vijay Chintalapati bio photo By Vijay Chintalapati Comment

Background

In this article we will explore a join optimization , in JBoss Data Virtualization, across tables of various data sources particularly when the join is performed on the union of two sets of tables where each table data participating in union is mutually exclusive from one another. The join optimization will result in conversion of this join of unions to union of joins where the joins are performed locally in each data source.

Too much for a background ? Hang in there and read on …

First, let’s talk about what a partitioned union is and how one can achieve it. To test this out locally, use the source code and provided instructions.

Partitioned Union

When is a union called a Partitioned Union in JBoss Data Virtualization? Again, straight from the Official Documentation :

Union partitioning is inferred from the transformation/inline view. If one (or more) of the UNION columns is defined by constants and/or has WHERE clause IN predicates containing only constants that make each branch mutually exclusive, then the UNION is considered partitioned. UNION ALL must be used and the UNION cannot have a LIMIT, WITH, or ORDER BY clause (although individual branches may use LIMIT, WITH, or ORDER BY). Partitioning values should not be null. The concept of a partitioned union is used for performing partition-wise joins

So, the various datasources participating in the union may as such have mutually exclusive datasets but the runtime engine wouldn’t automatically be aware of such a partitioning and hence the approaches to make it explicit.

Partition-wise Joins

Let’s consider a unique usecase as described below:

  • There are 3 (different) data sources
  • Each data source has the same schema and same/similar set of tables (but just with different names for clarity):
    • Data source 1 has schema S1 with tables x1 and y1, x1 and y1 have a primary key-foriegn key relationship
    • Data source 2 has schema S2 with tables x2 and y2, x2 and y2 have a primary key-foriegn key relationship
    • Data source 3 has schema S3 with tables x3 and y3, x3 and y3 have a primary key-foriegn key relationship
  • There isn’t an overlap of data between all the x tables and y tables respectively
  • The goal is to perform a join between a union of all x tables and union of all y tables as shown below :
  ( x1 UNION x2 UNION x3 ) JOIN ( y1 UNION y2 UNION y3 )

The runtime engine plans execution of such a query in the manner shown below :

  1. Get the data (ACCESS) for each x table and then perform UNION
  2. Get the data (ACCESS) for each y table and then perform UNION
  3. Perform the join between Union’ed X with Union’ed Y data

A graphical representation of the query plan would look something like below :

The above plan is right/correct but it is not optimal for cases where data is partitioned between data sources in such a way that there is no overlap whatsoever between the X table of one datasource to Y table of other two data sources. That is :

x1 JOIN y2 = [ ] (empty)
x1 JOIN y3 = [ ] (empty)

In such unique scenarios, the ideal and optimal outcome to the goal of the usecase above is :

  ( x1 JOIN y1 ) UNION ( x2 JOIN y2 ) UNION ( x3 JOIN y3 ) 

This is exactly what Partitioned Unions lets you do, that is to convert a join of unions to union of joins.

Demonstration

To demonstrate how to achieve the partition-wise join optimization, we will look at a demo where we show a join of unions before applying the optimization and after applying the optimization to clearly highlight the changes one should make to allow the optimization to kick-in.

Consider an account table data and a customer table data partitioned into three different MySQL instances 1, 2 and 3 respectively.

Before Partitioned Union

Account virtual table

Customer virtual table

Customer Account Map virtual table

Execution Plan

After Partitioned Union

Account virtual table

Notice how we are following the principles of Partitioned Union, have introduced a new column by the name Instance and given it a distinct value for each select of the union.

Customer virtual table

As with the Account table we do the same to the Customer table and introduce a new Instance column.

Customer Account Map virtual table

Here is where the magic happens, when we join the Customer and Account virtual tables we make sure we join on the Instance column as well in addition to the SSN column. While doing so, we ensure to suppress the Instance column as one of the projected fields (columns of virtual view table) to confirm to the initial design.

When the query engine notices that join is happening between two datasets for the same Instance value (let’s say 1), it performs the join on those two datasets (customer and account) from the same datasource before participating in a union including other partitions.

Execution Plan

Notice how the execution plan has now changed from join of unions to union of joins. Also, note the highlighted query of one of the joins, also shown below. The join is done all locally on MySQL_Instance_3 which optimizes the query execution resulting in faster fetching of results.

SELECT  g_1.AccountID, 
        g_1.AccountType, 
        g_1.SSN, 
        g_0.FIRSTNAME 
FROM    MySQL_Instance_3.customer AS g_0, 
        MySQL_Instance_3.account AS g_1 
WHERE   g_1.SSN = g_0.SSN
comments powered by Disqus