You should avoid key style distribution in the following scenarios: You should use this when your table has multiple rows to a single id and you intend to query or join using the dist key at a later date. If there are more dist keys than nodes, then failing any collocation, values with the same will be distributed out to nodes round robin style. If possible, they will go to the same slice that other data from other tables that share the same dist key are located, though that's not always possible. Tables over 10 million rows, you really don't want to have explain to a DBA why a 10m row table has been copied onto every node.Įnter fullscreen mode Exit fullscreen modeĪfter defining a column as the distribution key, Redshift will store rows with the same key on the same node slice.About 500k-1m is the sweet spot for ALL, tables smaller than this will be forcibly re-distributed even. You should avoid all style distribution in the following scenarios: There are four distribution styles you can specify when building a table:Īll copies the entire table to the first slice in every node so in an environment with four nodes, the table would be copied four times and obviously takes up 4x the amount of space, having a copy of the data on every node means faster queries, especially for joins, but has the inverse effect in that loading, updating, deleting or inserting takes longer, and remember Redshift is OLAP so these functions already take longer than an OLTP system.Īll should only be used on a static table that rarely change but is often referenced in a join or a cte as returning results will happen quickly but writing new data will take longer. Running a query that works against a specific tables distribution doesn't necessarily mean the query will be slow, after all you still have a heap of processors crunching data.ĭiagram of a Redshift Cluster (to remind you) However, if you are querying a pre-existing table, it's distribution style may actually work work against the query you are trying to run leaving you with two options (1) change your query to work with how a table is distributed, if possible, or (2) suck it up and deal. So if you are creating tables as part of a piece of analysis, you can have a hand in distributing the table in a way that will aid your future querying efforts. This tells Redshift how to spread your data across it's physical nodes. You achieve this when building tables by assigning a Distribution Style and Key. Rather than index at the beginning of a book outlining what subjects start and end where, instead each subject is in it's own book. Redshift takes a more physical approach to this. Just like an index in a book, if you know what page in a book a certain subject is on, you can jump right to it saving you from scanning through the whole book. In other data warehouses you can speed up table reads by defining an index on a table. As mentioned in my previous two articles, a table may be spread across multiple Compute Nodes and, is in part, one of the things that can result in Redshift returning results lightning quick.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |