Posted by Felipe Hoffa, GCP Developer Advocate
Exploring hidden trends and relationships in Stack Overflow data is a good lesson in doing SQL analytics with BigQuery.Great news: we’ve just added Stack Overflow's history of questions and answers to the collection of public datasets on BigQuery. This means that anyone with a Google Cloud Platform account can use SQL queries (or some other favorite tool) to dig into this treasure trove of data.
You can find some some sample queries on the Stack Overflow Data documentation page, for example:
- "What percentage of questions have been answered over the years?"
- "What is the reputation and badge count of users across different tenures on Stack Overflow?"
- "What are the 10 ‘easiest’ gold badges to earn?"
- "Which day of the week has most questions answered within an hour?"
Take these questions as a starting point, then feel free to share your results and query variations with us via reddit.com/r/bigquery. And if you have any questions, ask the community on Stack Overflow.
Diving into the data
You might be wondering: What's so special about querying Stack Overflow with BigQuery? After all, Stack Overflow already refers users to Stack Exchange Data Explorer (SEDE), a data focused site where users have shared and prioritized thousands of questions—and that works really well. So, let's review some of the advantages of having Stack Overflow data in BigQuery too:
- Surpass the 50,000 row limit. SEDE can only output up to 50,000 rows. This is not a problem for BigQuery.
- Robots welcome. SEDE protects itself from abuse with CAPTCHAs, and has no API. With BigQuery no CAPTCHAs are needed to login, and its REST API allows a variety of tools to leverage its power. Feel free to connect Tableau, re:dash, Looker, R, pandas, and your favorite tools to it.
- JOIN everything. There are plenty of other datasets shared on BigQuery, and there’s nothing stopping you from loading even more, privately or for public consumption. Imagine the questions you could answer by querying across them?
Let’s look at an example of joining. We have terabytes of GitHub's open source code shared on BigQuery. Let’s find out which are the most referenced Stack Overflow questions in the GitHub code—specifically, Javascript.
#standardSQL
SELECT a.id, title, c files, answer_count answers, favorite_count favs,
view_count views, score
FROM `bigquery-public-data.stackoverflow.posts_questions` a
JOIN (
SELECT CAST(REGEXP_EXTRACT(content,
r'stackoverflow.com/questions/([0-9]+)/') AS INT64) id, COUNT(*) c,
MIN(sample_path) sample_path
FROM `fh-bigquery.github_extracts.contents_js`
WHERE content LIKE '%stackoverflow.com/questions/%'
GROUP BY 1
HAVING id>0
ORDER BY 2 DESC
LIMIT 10
) b
ON a.id=b.id
ORDER BY c DESC
Here are the most referenced Stack Overflow questions within Javascript code on GitHub:
Or, we can look at GitHub pull-request comments from GHTorrent (also on BigQuery):
#standardSQL
SELECT a.id, title, c files, answer_count answers, favorite_count favs,
view_count views, score
FROM `bigquery-public-data.stackoverflow.posts_questions` a
JOIN (
SELECT CAST(REGEXP_EXTRACT(body,
r'stackoverflow.com/questions/([0-9]*)/') AS INT64) id, COUNT(*) c
FROM `ghtorrent-bq.ght.pull_request_comments`
WHERE body LIKE '%stackoverflow.com/questions%'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
) b
ON a.id=b.id
ORDER BY c DESC
Here are the results:
Or, let's look at Hacker News. What are most popular tags of questions that have been posted there since 2014?
#standardSQL
SELECT tag, SUM(c) c
FROM (
SELECT CONCAT('stackoverflow.com/questions/', CAST(b.id AS STRING)),
title, c, answer_count, favorite_count, view_count, score, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
JOIN (
SELECT CAST(REGEXP_EXTRACT(text,
r'stackoverflow.com/questions/([0-9]+)/') AS INT64) id, COUNT(*) c
FROM `fh-bigquery.hackernews.comments`
WHERE text LIKE '%stackoverflow.com/questions/%'
AND EXTRACT(YEAR FROM time_ts)>=2014
GROUP BY 1
ORDER BY 2 DESC
) b
ON a.id=b.id),
UNNEST(tags) tag
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
Here are the most popular tags on Stack Overflow questions linked from Hacker News since 2014:
How does that compare to the rest of Stack Overflow?
#standardSQL
SELECTtag, COUNT(*) c
FROM (
SELECT SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
WHERE EXTRACT(YEAR FROM creation_date)>=2014
), UNNEST(tags) tag
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
It would seem that the Hacker News community cares a lot more about Haskell, C, C++, and performance than Stack Overflow as a whole, which lists php, android, jquery, and css within its most popular tags:
Next steps
If you haven't tried BigQuery yet, follow this Beginner’s Tutorial, which shows how to analyze 50 billion page views in 5 seconds. Then, you’re ready to feel free to play with any other query or dataset you like: for example, our official public BigQuery datasets, datasets that other users have shared, and of course your very own.