Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 39 additions & 5 deletions axiom/optimizer/Schema.h
Original file line number Diff line number Diff line change
Expand Up @@ -128,13 +128,42 @@ class Locus {

using LocusCP = const Locus*;

/// A Distribution describes how data is divided between files for
/// data at rest and workers for data in flight based on values of
/// special columns. We use the word partitioning to mean that where
/// a row of data at rest or in flight is found depends on a
/// function of one or more columns of the row. Hash partitioning nd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: partitioning nd -> partitioning and

/// range partitioning are examples. The distribution can also state
/// that rows within a partition are sorted on some columns. The
/// rows of a partition are the rows of a dataset for which the
/// partitioning function satisfy some criteria, e.g. has % number
/// of partitions == id of partition. The corresponding concept in
/// Hive, e.g. Presto or Spark is bucketing. the word partitioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: the word -> The word

/// in Hive means something else. When Verax uses partitioning the
/// Hive translation is bucketing. Number of partitions means how many ways the
/// dataset is split by hash or range of partitioning keys. The Hive word for
/// this is count of buckets. Hive partitioning columns, i.e. columns whose
/// value directly specifies a file system directory where the file is found,
/// are largely transparent to Verax.

/// Two datasets are copartitioned if for one partition of one set
/// we know that a join on partitioning keys can only fall in a
/// limited set of partitions of the other dataset. This means that
/// the partitioning function is the same and the count of
/// partitions on either side are the same or one is an integer
/// multiple of the other.

/// A DistributionType specifies the partitioning function and the number of
/// partitions. Copartitioning exists if datasets are joined on partitioning
/// columns and the DistributionTypes are compatible.

/// Method for determining a partition given an ordered list of partitioning
/// keys. Hive hash is an example, range partitioning is another. Add values
/// here for more types.
/// keys. Hive hash bucketing is an example, range partitioning is another. Add
/// values here for more types.
enum class ShuffleMode { kNone, kHive };

/// Distribution of data. 'numPartitions' is 1 if the data is not partitioned.
/// There is copartitioning if the DistributionType is the same on both sides
/// There is copartitioning if the DistributionType iscompatible on both sides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: iscompatible -> is compatible

Should there be a function that takes 2 DistributionType's and returns true if they are compatible?

/// and both sides have an equal number of 1:1 type matched partitioning keys.
struct DistributionType {
bool operator==(const DistributionType& other) const {
Expand All @@ -148,8 +177,8 @@ struct DistributionType {
bool isGather{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you document locus and isGather properties? Why do we need 'locus'?

};

// Describes output of relational operator. If base table, cardinality is
// after filtering.
// Describes data at rest or output of relational operator. If base table,
// cardinality is after filtering.
struct Distribution {
Distribution() = default;
Distribution(
Expand Down Expand Up @@ -206,6 +235,11 @@ struct Distribution {
/// True if 'other' has the same ordering columns and order type.
bool isSameOrder(const Distribution& other) const;

/// makes a new Distribution where 'exprs' are replaced with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: makes -> Makes

/// corresponding element of 'names'. Expressions in 'this' but not
/// in 'exprs' are not in the result. Recomputes partitioning and
/// ordering properties. These can be lost if partitioning or
/// ordering columns are not copied.
Distribution rename(const ExprVector& exprs, const ColumnVector& names) const;

std::string toString() const;
Expand Down
2 changes: 1 addition & 1 deletion axiom/optimizer/connectors/ConnectorMetadata.h
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ class TableLayout {

/// Set of partitioning columns. The values in partitioning columns determine
/// the location of the row. Joins on equality of partitioning columns are
/// co-located.
/// co-located. In Hive these are called bucketing columns.
const std::vector<const Column*>& partitionColumns() const {
return partitionColumns_;
}
Expand Down
3 changes: 3 additions & 0 deletions axiom/optimizer/connectors/hive/HiveConnectorMetadata.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ class HiveTableLayout : public TableLayout {
return fileFormat_;
}

/// Hive partitioning columns. partitionColumns() for a Hive layout return the
/// bucketing columns. This returns the partitioning columns, i.e. columns
/// whose values define the file system path of a file.
const std::vector<const Column*>& hivePartitionColumns() const {
return hivePartitionColumns_;
}
Expand Down