Part 2: SQL Server Parallel Data Warehouse – Best Thing Since Sliced Bread?

Read Part 1

Appliance or Bunch of Packaged Components and Ease of Use

It’s easy to call something that you can order with a single product number an appliance or integrated system. Certainly prepackaging all the components together and taking the work of designing a balanced architecture and initial configuration out of the equation saves on time-to-value or deployment. The other side of the coin is the on-going effort for administration and new development. PureData System for Analytics or Netezza was built from the ground up as an integrated hardware + software solution for optimal analytic performance without extensive tuning work. It uses a unique Asymmetric Massively Parallel Processing architecture in which a dedicated FPGA layer filters out 95-98% of table data, keeping only the data needed to answer a specific query – you can find more detail on the architecture here.

PDW as I mentioned was born out of some combination of DATAllegro and SQL Server. What elements of which technology are used where in PDW is not clear to me, though Microsoft papers speak of SQL Server running on each of the nodes. So what you have is an MPP variant of SQL Server running on HP or Dell hardware densely packed with processing resources. SQL Server as I recall does involve administration and performance tuning, and moving into a shared nothing architecture can dramatically increase complexity. I understand that if you use PDW’s ColumnStore Indexes, you don’t need to deal with traditional index creation, but I’m not convinced this translates into zero to minimal tuning. And you can’t get a good understanding of PDW administration or tuning tasks because the PDW documentation and community is only accessible to PDW customers (wonder why…).

Performance, Concurrency, Workload Management 

Performance and concurrency go hand in hand. It’s great that if you send a single complex query through a system and it runs in seconds – you often hear shiny statements like “this query ran in 5 seconds, 5 gazillion times faster than on previous system“, where the previous system could be my 2 year old laptop or an abacus for all you now (an in memory db with a four letter name comes to mind). But if a DBMS cannot effectively managed the execution of requests to make the most of its resources, performance can hit rock bottom.

Parallel Data Warehouse uses a fairly basic workload management scheme. A rack of PDW has 32 concurrency slots, or shares of CPU and memory. By default a query coming in gets one of these, a 1/32 share of resources. After V1, PDW development must have realized that more complex queries might require more than this 1/32 share and created classes that you could map queries to provide them with a larger share of resources (small, medium, large, and extra large). PureData System for Analytics provides a number of approaches to managing workloads efficiently. Two of these are Guaranteed Resource Allocation (GRA) and Short Query Bias (SQB). GRA assigns a minimum and maximum share of resources to groups of queries, so that while a set of queries is guaranteed a minimum share, it can use up to its maximum if resources are available.  SQB automatically lets queries which will have a short run time execute even if larger queries are in progress.

Though I do not have detailed comparisons to share externally at this point, I will say that from accounts I have heard of head to head situations, PDW has struggled with a high number of concurrent queries running across a large data set compared to PDA/Netezza. Don’t take my word for it, put it to the test!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s