About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.
Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.
Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:
select count(*) from (select e2.Source as e2Source, e2.Destination as e2Destination, e3.Source as e3Source, e3.Destination as e3Destination from (select * from [Edges.EdgesFull]) as e2 join each (select * from [Edges.EdgesFull]) as e3 on e2.destination = e3.source where e2.source < e3.source) as e4 join each (select * from [Edges.EdgesFull]) as e1 on e1.destination = e4.e2source and e4.e3destination = e1.source where e1.source < e4.e2source
Here’s the evidence:
It took 98.7 seconds to run, which I think is pretty good. Some remarks:
That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.