Synapse Analytics Serverless can’t Resolve this Query
Microsoft Support is unable to determine why Synapse is having difficulty resolving a simple query.
The Query
SELECT A.Date
,COUNT(B.TransactionDate) as TransactionCount
FROM VW_DimDate A LEFT JOIN VW_FactSalesTransaction_v2 B
ON A.Date = B.TransactionDate
WHERE YEAR(A.Date) = 2022
GROUP BY A.DATE
The error message:
Why do we need it?
Please note that this is only a sample query, representing a simple relationship between a Date dimension and a Fact table. PowerBI will be used to consume this dataset.
Instead of the year, any other column in the dimension table can be used as the where clause like Fiscal Year, Fiscal Quarter or whatever. (I apologize for saying that, but that was one of the stupid questions asked by MS support).
A more realistic variation would be like this
SELECT A.Date
,COUNT(B.TransactionDate) as TransactionCount
FROM VW_DimDate A LEFT JOIN VW_FactSalesTransaction_v2 B
ON A.Date = B.TransactionDate
WHERE A.PromotionWeekEndDate BETWEEN '20220601' AND '20220630'
GROUP BY A.DATE
How big is the data set?
Currently, the data set is quite small, approximately 25GB in compressed Parquet files for one client. It can easily exceed 100GB for a bigger client.
Due to the expected data size, we are considering PowerBI direct query instead of the Import model.
To accelerate queries, we may use aggregated Parquet tables created by CTAS, but I will discuss that in a later post. Right now, we are using the RAW transaction data with a granularity level down to Transaction/item.
Background information
- We are using Synapse Serverless Instance.
- Parquet files were created using Synapse Pipelines and a Copy activity from SQL Server to Azure Data Lake Gen2.
- We are using Views, instead of External Tables, to expose the Parquet Files in Synapse as External Tables don’t expose the Partitioning columns.
- The VIEWs are using the best practices, like enforcing all column datatypes.
- Using the
filepath()
function, we apply the correct data types to the partition columns, which are exposed as transaction_year, transaction_month, and transaction_date. - The transactions sales view maps to parquet files partitioned by Year/Month/Date with approximately 2.8 bi-rows and an average of 70 million rows per month.
CREATE VIEW VW_FactSalesTransaction_v2
AS
SELECT
CAST(r.filepath(1) AS SMALLINT) as TransactionYear
,CAST(r.filepath(2) AS TINYINT) as TransactionMonth
,CAST(r.filepath(3) AS DATE) as TransactionDate
,col01
,col02
,col03
,col04
,col05
,col06
,col07
,col08
,col09
,col10
,col11
,col12
,col13
,col14
,col15
,col16
,col17
,col18
,col19
,col20
,col21
FROM OPENROWSET(
BULK 'FactSalesTransaction/transaction_year=*/transaction_month=*/transaction_date=*/*.parquet'
,FORMAT = 'PARQUET'
,DATA_SOURCE = 'my_data_source')
WITH (
col01 bigint
,col02 int
,col03 int
,col04 int
,col05 int
,col06 bigint
,col07 int
,col08 numeric(19,4)
,col09 numeric(19,4)
,col10 numeric(19,4)
,col11 numeric(19,4)
,col12 numeric(19,4)
,col13 numeric(19,4)
,col14 numeric(19,4)
,col15 numeric(19,4)
,col16 datetime2
,col17 datetime2
,col18 numeric(19,4)
,col19 numeric(19,4)
,col20 numeric(19,4)
,col21 numeric(19,4)
)
- The date dimension was also exported from the original DWH and contains many fields that users may consume as report filters in PowerBI.
CREATE VIEW VW_DimDate
AS
SELECT [DateKey]
,[Date]
,[FullDateUK]
,[FullDateUSA]
,[DayOfMonth]
,[DaySuffix]
,[DayName]
,[DayOfWeek]
,[DayOfWeekInMonth]
,[DayOfWeekInYear]
,[DayOfQuarter]
,[DayOfYear]
,[WeekOfMonth]
,[WeekOfQuarter]
,[WeekOfYear]
,[Month]
,[MonthName]
,[MonthOfQuarter]
,[Quarter]
,[QuarterName]
,[Year]
,[YearName]
,[MonthYear]
,[YYYYMM]
,[WeekStartDate]
,[WeekEndDate]
,[AdweekStartDate]
,[AdWeekEndDate]
,[MonthStartDate]
,[MonthEndDate]
,[QuarterStartDate]
,[QuarterEndDate]
,[YearStartDate]
,[YearEndDate]
,[SequentialDay]
,[SequentialWeek]
,[SequentialFortnight]
,[DayOfFortnight]
,[FortNightStartDate]
,[FortNightEndDate]
,[AuditCreated]
,[AuditLastUpdated]
,[AuditCheckSum]
,[UnixTimeStamp]
,[PromotionWeekStartDate]
,[PromotionWeekEndDate]
,[WeekStartDatePreviousYear]
FROM
OPENROWSET(
BULK 'DimDate/**'
,FORMAT = 'PARQUET'
,DATA_SOURCE = 'my_data_source )
WITH (
[DateKey] int
,[Date] date
,[FullDateUK] varchar(10)
,[FullDateUSA] varchar(10)
,[DayOfMonth] tinyint
,[DaySuffix] char(10)
,[DayName] varchar(15)
,[DayOfWeek] tinyint
,[DayOfWeekInMonth] tinyint
,[DayOfWeekInYear] tinyint
,[DayOfQuarter] tinyint
,[DayOfYear] smallint
,[WeekOfMonth] tinyint
,[WeekOfQuarter] tinyint
,[WeekOfYear] tinyint
,[Month] tinyint
,[MonthName] varchar(15)
,[MonthOfQuarter] tinyint
,[Quarter] tinyint
,[QuarterName] varchar(10)
,[Year] smallint
,[YearName] varchar(10)
,[MonthYear] varchar(10)
,[YYYYMM] int
,[WeekStartDate] date
,[WeekEndDate] date
,[AdweekStartDate] date
,[AdWeekEndDate] date
,[MonthStartDate] date
,[MonthEndDate] date
,[QuarterStartDate] date
,[QuarterEndDate] date
,[YearStartDate] date
,[YearEndDate] date
,[SequentialDay] smallint
,[SequentialWeek] smallint
,[SequentialFortnight] smallint
,[DayOfFortnight] tinyint
,[FortNightStartDate] date
,[FortNightEndDate] date
,[AuditCreated] date
,[AuditLastUpdated] date
,[AuditCheckSum] varbinary(64)
,[UnixTimeStamp] int
,[PromotionWeekStartDate] date
,[PromotionWeekEndDate] date
,[WeekStartDatePreviousYear] date) AS r
My first approach was to reduce the number of columns in the Data Dimension in order to determine if any of these columns were causing the problem. CTAS was the easiest way to create a new Parquet file. (Lazy me!)
CREATE EXTERNAL TABLE CTAS_Dates
WITH (
LOCATION = 'aggregated_data/CTAS_Dates',
DATA_SOURCE = my_data_source,
FILE_FORMAT = parquet_file_format)
AS
SELECT CAST("Date" AS Date) AS Date
,Year(Date) AS Year
,Month(Date) AS Month
FROM VW_DimDate AS D
GO
However, the error persists.
Microsoft Support Suggestions
Microsoft Support suggested some “improvements” in my queries that I decided to implement even though sometimes I didn’t agree with them.
Suggestion A — Do not use functions in the Where Clause
“Using functions in the where clause is not recommended, as it can affect how the SQL optimiser estimates the cost of the query. We noticed that the query seems to be scanning all years (not just 2022)”
This seems relevant and I have to admit I didn’t know about it, so let's try:
Suggestion B — Modify the query in such a way that it should filter table A and B on Year 2022 before LEFT JOIN. Which will only refer to the specific data of year 2022 before performing left join.
This one seems quite silly to me. As I mentioned before, this query is just a sample, and real queries may not be so easy to filter before joining (and also make little sense).
In order to implement this design, we need to use common table expressions and then join the results.
Additionally, this approach does not reflect the natural behavior of users.
The situation becomes more bizarre
OK, let’s start over and, for a moment, admit that using a where clause does not make sense since I can filter directly based on the transaction_year field from the fact table, so, let’s do that.
Using the Fact Table Partition Key
Great! It works, just fine. Let’s now add a scenario from the real world.
Fact table Partition Key and a Filter from the Dimension Table
Okay, that works too!
But, That’s not Natural
Let’s remove the Partition column from the query once again. In general, people do not write queries in this manner, rather they write them in the following manner:
How About Power BI
But wait a moment! Power BI won’t write queries like the ones above. Using the Performance Analyser, we can extract exact the query it is sending back to the Synapse.
It's more likely it will be something like this: (extracted from the report)
// Direct QuerySELECT
COUNT_BIG([t0].[SalesTransactionKey])
AS [a0]
FROM
((
select [$Table].[TransactionYear] as [TransactionYear],
[$Table].[TransactionMonth] as [TransactionMonth],
[$Table].[TransactionDate] as [TransactionDate],
[$Table].[col01],
[$Table].[col02],
...
[$Table].[coln]
from [dbo].[VW_FactSalesTransaction_v2] as [$Table]
) AS [t0]LEFT OUTER JOIN(
select [$Table].[DateKey] as [DateKey],
[$Table].[Date] as [Date],
[$Table].[FullDateUK] as [FullDateUK],
...
[$Table].[WeekStartDatePreviousYear] as [WeekStartDatePreviousYear]
from [dbo].[VW_DimDate] as [$Table]
) AS [t1] on
(
[t0].[TransactionDate] = [t1].[Date]
)
)WHERE
(
([t1].[PromotionWeekEndDate] < CAST( '20220811 00:00:00' AS datetime))
AND
([t1].[PromotionWeekEndDate] >= CAST( '20220418 00:00:00' AS datetime))
)
I was surprised to find that this also worked!
How it Goes with Other Dimensions
Let’s use a Fake Store dimension.
Conclusion
Microsoft Support is still investigating the cause of this behavior at the time of this writing.
My humble opinion is that there is nothing wrong with the query.
It appears to be an issue when using the partition columns in a join. The Date Dimension and the Fact Table are connected by the “Fact.Transaction_Date = DimDate.Date” columns, which is the lowest level of partition keys.
Dim Date is not partitioned of course.
In Dim Stores, there is no partition, and it is joined to storeKey in the fact table, which is also not a partition.