Implementing Full "data ship" Pushdown in JBoss Data Virtualization

Vijay Chintalapati bio photo By Vijay Chintalapati Comment

Full Pushdown in Dependent Joins

Let’s first talk about what a Dependent Join is. Below is an excerpt straight from the Official Documentation

A special optimization called a dependent join is used to reduce the rows returned from one of the two relations involved in a multi-source join. In a dependent join, queries are issued to each source sequentially rather than in parallel, with the results obtained from the first source used to restrict the records returned from the second. Dependent joins can perform some joins much faster by reducing the amount of data retrieved from the second source and the number of join comparisons that must be performed.

Now let’s see what a Full (data-ship) Pushdown is in the context of Dependent Joins, again straight from the documentation :

Full Pushdown - where translator ships the all data from the independent side to the translator. Can be used automatically by costing or can be specified as an option in the hint.

In the context of Dependent Joins in JBoss Data Virtualization, Full Pushdown causes the runtime to first query the independent side of the join (typically the table with fewer rows), extract the data and ship them to dependent side of the join (typically the table with a large dataset) by creating a temp table and then inserting the shipped rows into it. The final join happens locally on the dependent data source, between the temp table and the table with the large dataset.

Click on the image below (courtesy: Ramesh Reddy) for a detailed article on Dependent Joins

When might one need a Full Pushdown ?

A Full Pushdown is needed almost all the time when the following criteria is met:

  • There are joins between two data source tables with disproportionate sizes
  • Either the runtime implicitly determines decides that the join is dependent based on cardinality data of the tables or the join is explicitly marked as one with MAKEDEP hints
  • The join output projects fields from both data source tables
  • (Optional) The join is on more than one column

Implementing Full Pushdown in JBoss Data Virtualization

Let’s chalk up a usecase for implementing pushdown in JDV. Lets say we have the following setup:

  • Two datasources : MySQL and Postgres
  • Similar employees table in both data sources
  • Postgres employees table has only 10 records, whereas the records in MySQL are in several thousands
  • The goal is to return only the matching records in both tables that can be (inner) joined on few column names

The setup can be understood with the following image where a view (orange table) is created using a join on source tables (in blue).

Create a Translator Override

Once a VDB is created which includes all the source models and the view models, the next thing to do is to create translator overrides as shown below.

Focus your attention on the overridden properties that have eraser symbol in them. Since MySQL source table is the dependent side of the join, a translator override with the name pd-mysql5 is created in the Translator Overrides tab of the VDB editor with the following properties overridden:

  • Enable Dependent Joins = true
  • Max number of dependent IN predicates = 1
  • Max number of IN predicate entries = 1

Note: The last two properties are thresholds. Only after exceeding these thresholds the full pushdown is enforced and hence these are deliberately set to a very low value. One must choose appropriate values based on the use case

Apply the Translator Override to Source Model

Once the translator override is in place we can now associate the overridden translator with the actual source model (as shown) before (saving and) deploying the VDB.

So in our use case, we selected the MySQL_Employees.xmi on the left in the VDB editor and associated the translator pd-mysql in the Source Binding Definition tab (shaded in blue) on the right.

Deploy the VDB and Test

Once the VDB is deployed, test your view by issuing query as shown below:

You would see the following lines show up in the teiid-command.log. Notice how a temporary table was created on MySQL on the 6th line below and how a local join is performed in the 8th and 9th line. We achieved the Full Pushdown with this example.

06:08:55,400 INFO  [org.teiid.COMMAND_LOG] (New I/O worker #17) B9WJQ/2I16iF    START USER COMMAND: startTime=2016-12-15 06:08:55.4 requestID=B9WJQ/2I16iF.0    txID=null   sessionID=B9WJQ/2I16iF  applicationName=JDBC    principal=teiidUser@teiid-security  vdbName=pushdown-dvdemo vdbVersion=1    sql=select * from Employees.employees
06:08:55,417 DEBUG [org.teiid.COMMAND_LOG] (Worker10_QueryProcessorQueue51) B9WJQ/2I16iF    START DATA SRC COMMAND: startTime=2016-12-15 06:08:55.417   requestID=B9WJQ/2I16iF.0sourceCommandID=2   executionID=10  txID=null   modelName=PostgreSQL_Employees  translatorName=pd-postgresql    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  sql=SELECT g_0.emp_no AS c_0, g_0.birth_date AS c_1, g_0.first_name AS c_2 FROM PostgreSQL_Employees.employees AS g_0 ORDER BY c_0, c_1, c_2
06:08:55,417 DEBUG [org.teiid.COMMAND_LOG] (Worker10_QueryProcessorQueue51) B9WJQ/2I16iF    SOURCE SRC COMMAND: endTime=2016-12-15 06:08:55.417 requestID=B9WJQ/2I16iF.0    sourceCommandID=2   executionID=10  txID=null   modelName=PostgreSQL_Employees  translatorName=pd-postgresql    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  sourceCommand=[SELECT g_0."emp_no" AS c_0, g_0."birth_date" AS c_1, g_0."first_name" AS c_2 FROM "public"."employees" AS g_0 ORDER BY c_0, c_1, c_2]
06:08:55,420 DEBUG [org.teiid.COMMAND_LOG] (Worker9_QueryProcessorQueue52) B9WJQ/2I16iF     END SRC COMMAND:    endTime=2016-12-15 06:08:55.42  requestID=B9WJQ/2I16iF.0    sourceCommandID=2   executionID=10  txID=null   modelName=PostgreSQL_Employees  translatorName=pd-postgresql    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  finalRowCount=10    cpuTime(ns)=2382271
06:08:55,427 DEBUG [org.teiid.COMMAND_LOG] (Worker10_QueryProcessorQueue53) B9WJQ/2I16iF    START DATA SRC COMMAND: startTime=2016-12-15 06:08:55.427   requestID=B9WJQ/2I16iF.0sourceCommandID=3   executionID=11  txID=null   modelName=MySQL_Employees   translatorName=pd-mysql5    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  sql=SELECT g_0.emp_no AS c_0, g_0.birth_date AS c_1, g_0.first_name AS c_2, g_0.last_name AS c_3, g_0.gender AS c_4, g_0.hire_date AS c_5 FROM MySQL_Employees.employees AS g_0 WHERE (g_0.emp_no IN (<dependent values>)) AND (g_0.birth_date IN (<dependent values>)) AND (g_0.first_name IN (<dependent values>)) ORDER BY c_0, c_1, c_2
06:08:55,428 DEBUG [org.teiid.COMMAND_LOG] (Worker10_QueryProcessorQueue53) B9WJQ/2I16iF    SOURCE SRC COMMAND: endTime=2016-12-15 06:08:55.428 requestID=B9WJQ/2I16iF.0    sourceCommandID=3   executionID=11  txID=null   modelName=MySQL_Employees   translatorName=pd-mysql5    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  sourceCommand=[INSERT INTO TEIID_DKJ1 (COL1, COL2, COL3) VALUES (?, ?, ?)]
06:08:55,432 DEBUG [org.teiid.COMMAND_LOG] (Worker10_QueryProcessorQueue53) B9WJQ/2I16iF    SOURCE SRC COMMAND: endTime=2016-12-15 06:08:55.432 requestID=B9WJQ/2I16iF.0    sourceCommandID=3   executionID=11  txID=null   modelName=MySQL_Employees   translatorName=pd-mysql5    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  sourceCommand=[SELECT g_0.`emp_no` AS c_0, g_0.`birth_date` AS c_1, g_0.`first_name` AS c_2, g_0.`last_name` AS c_3, g_0.`gender` AS c_4, g_0.`hire_date` AS c_5 FROM TEIID_DKJ1, `employees` AS g_0 WHERE g_0.`emp_no` = TEIID_DKJ1.COL1 AND g_0.`birth_date` = TEIID_DKJ1.COL2 AND g_0.`first_name` = TEIID_DKJ1.COL3 ORDER BY c_0, c_1, c_2]
06:08:55,435 DEBUG [org.teiid.COMMAND_LOG] (Worker9_QueryProcessorQueue54) B9WJQ/2I16iF     END SRC COMMAND:    endTime=2016-12-15 06:08:55.435 requestID=B9WJQ/2I16iF.0    sourceCommandID=3   executionID=11  txID=null   modelName=MySQL_Employees   translatorName=pd-mysql5    sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  finalRowCount=10    cpuTime(ns)=6489893
06:08:55,446 INFO  [org.teiid.COMMAND_LOG] (Worker9_QueryProcessorQueue55) B9WJQ/2I16iF     END USER COMMAND:   endTime=2016-12-15 06:08:55.446 requestID=B9WJQ/2I16iF.0    txID=null   sessionID=B9WJQ/2I16iF  principal=teiidUser@teiid-security  vdbName=pushdown-dvdemo vdbVersion=1    finalRowCount=10
comments powered by Disqus