To join data together you need the data to be colocated - this idea is to colocate a subset of joined data on each node so that the data can be split up
If you've got lots of data - more than can fit in memory you need a way to split the data up between machines. This idea is to prejoin data at insert time and distribute the records that join together to the same machine
this way the overall dataset can be split between machines. But the join can be executed on every node and then aggregated for the total joined data.
In more detail, we decide which machine stores the data by consistently hashing it with the join keys. So the matching records always get stored on the same machine.