Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Examining crime reports in San Francisco with SQL

$
0
0

22 Aug 2016

Want to make data analysis fast for everyone?

Join Us!

When I first moved to San Francisco, my mother warned me “not to go looking for trouble”, but this can be difficult when you do not know the city well. Luckily, SF OpenData provides a great public safety dataset, and our good friend Redshift has a few helpful tricks to help make sure trouble doesn’t find us.

To ease my mother’s concerns, we’ll examine all of the crime reports in San Francisco from the start of 2003 to the end of 2015.

We can start by plotting the total monthly crime and overlay trendlines for each year to get a full picture of our data.

Let’s break down the data into three areas: crime by day of the week, hour of the day, and the most criminally active days of the year.

Analyzing Crime by Day of Week and Hour of Day

Redshift’s extract() function allows us to easily retrieve the day of the week and the hour of the day from a timestamp. Applying it to our dataset we can build a query that extracts the dayofweek and hour from the date:

selectextract(dayofweekfromdate),extract(hourfromdate),count(1)fromsf_crime_2003_2015groupby1,2orderby3

Redshift’s dayofweek datepart returns a value from 0 to 6, with 0 representing Sunday and 6 representing Saturday. We can build a table with these mappings, and we can join it to our query above to make the results more readable.

select0,'Sunday'unionselect1,'Monday'unionselect2,'Tuesday'unionselect3,'Wednesday'unionselect4,'Thursday'unionselect5,'Friday'unionselect6,'Saturday'

After converting to readable day of week names, we compile our final result

It looks like 5 and 6 pm are the most crime-ridden parts of the day, with Friday being the most criminally active day. It could also be an effect of how this data is collected - officers may be more likely to mark the time they are filing their report at the end of the day (and end of the week) than the time of a crime’s occurrence.

Most Criminal Day of the Year

Continuing with our use of the extract() function, we can now count the number of reports and group by the extracted day of year. This allows us to determine which days see the most criminal reports. There are two issues that arise when we use this method:

  1. Grouping by dayofyear does not account for leap years
  2. The query will return the groupings based on the numeric day of the year, rather than human readable MM-DD values

We can account for the leap year issue by matching which years are leap years in a case when statement and subtracting a day after we cross Feb. 29th, the 60th day of the year

withcrime_by_doyas(select-- Check to see if it is a leap yearcasewhen(extract(yearfromdate)%4=0andextract(yearfromdate)%100<>0)orextract(yearfromdate)%400=0thencasewhenextract(dayofyearfromdate)=60then'Feb. 29'whenextract(dayofyearfromdate)>60then(extract(dayofyearfromdate)-1)::textelse(extract(dayofyearfromdate))::textendelse-- If it is not a leap year, no conversions are necessary(extract(dayofyearfromdate))::textendasdoy,count(1)asnum_incidentsfromsf_crime_2003_2015groupby1)

Our outermost case when identifies whether or not we are looking at a timestamp from a leap year. When we are, we shift the dayofyear extraction to move all values on February 29th to be their own text string. The remaining extractions ensure we match up the dayofyear values with those of non-leap years.

Now that we have addressed the first issue, we can move on to translating the dayofyear back into a readable format. To do this, we will use Redshift’s to_char() function. We will first need a list of dates from a non-leap year. To help generate that list of dates, we turn to our post on generating series in Redshift.

,generated_seriesas(select(getdate()::date-row_number()over(orderbytrue))::dateasnfromsf_crime_2003_2015limit730-- value for 2 years),doy_mappingas(select(extract(doyFROM(n)::timestamp))::textas"doy",to_char(n,'Month DD')as"mm_dd"fromgenerated_serieswhereextract(yearfromn)=2015)selectcrime_by_doy.doy,"mm_dd"asdate,crime_by_doy.num_incidentsfromcrime_by_doyjoindoy_mappingoncrime_by_doy.doy=doy_mapping.doyorderby3desc

The doy_mapping CTE uses the to_char() function to convert our dates from YYYY-MM-DD into Month DD. In the doy_mapping where clause, we can use any year that we know is not a leap year to ensure we are returning 365 results. Finally, our main query joins our CTEs together to give us an easily readable result set!

Interestingly, it looks like the first day of a month is the prime time for crime! In particular, New Year’s Day makes it seem that some of the New Year’s Resolutions may be less than lawful. To know for sure, we would have to investigate whether this is an artifact of how this data is reported and collected, or if it is a real phenomenon.

On the opposite end of the spectrum, we can order by the count ascending to identify the least criminal days of the year. Here, we see crime reporting drops off sharply with the Holiday season.

We can plot this to see how the number of crimes reported fluctuates over the course of a year.

Crime is fairly consistent over the course of the year. There is a small but notable increase in September-October (when the weather is nicest in San Francisco) before quickly dropping off in the Holiday season.

Looking at different time and date series is a great way to build familiarity with a dataset. We hope extract() helps with your time and date analytics!

Thank you


Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>