About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.

Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.

Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join each
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

Here’s the evidence:

image

It took 98.7 seconds to run, which I think is pretty good. Some remarks:

  • I’m not 100% sure this is the correct result, but I’m fairly confident.
  • I’m a long way from being 100% sure this is the most efficient way to write the query.
  • When the original Vertica blog post came out it sparked a p*ssing contest between vendors who wanted to to show they could do better than Vertica. You can see an example of other people’s results here; clearly other platforms could do a lot better, and remember these results are two years old.
  • Obviously you can’t judge the performance of a database on one query.

That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.