intro
If you know Excel you can easily start learning SQL.
SQL is the programming language used to explore, analyze, and extract data from relational databases, which are common data storage mechanisms. We also know that many individuals who want to break into a data career are often unsure of how to start learning SQL, as they have never touched a relational database. Fortunately, many of these same individuals have used Excel (or similar spreadsheet software) and understand some of the fundamentals of working with data. Fortunately, conducting similar data tasks in SQL is fairly straightforward and the examples below will show you how.
All materials can be downloaded here. These examples use the New York City Airbnb Open Data, which is directly downloadable from Kaggle.
For more information on establishing a connection to a SQLite database in DbVisualizer, see our previous post Why Can’t We Just Have a Single Table? Navigating Relationships in Relational Databases.
View Your Data
Excel
In Excel we can simply open a file and the data will appear.
Excel
In SQL we use SELECT statements to output data. The example below uses SELECT * FROM AB_NYC_2019 which is an instruction to select (SELECT) all columns (*) from (FROM) the table named AB_NYC_2019. The media below shows the execution of this command in DbVisualizer.
Filter
Excel
Filtering on one or more values in a column can be accomplished with the Sort & Filter tool in the Home menu.
SQL
The WHERE clause is used to filter on values in SQL. Note that the WHERE clause must be written after the initial SELECT statement. The code below selects all columns from AB_NYC_2019 where the neighborhood_group column has a value of “Brooklyn”. Be aware that column values which are text rather than numbers, as is the case with values in neighborhood_group, must be contained between quotation marks.
1
SELECT * FROM AB_NYC_2019
2
WHERE neighbourhood_group = "Brooklyn";
SQL
Mutliple filter conditions can be used in a WHERE clause with the OR operator.
1
SELECT * FROM AB_NYC_2019
2
WHERE neighbourhood_group = "Brooklyn"
3
OR neighbourhood_group = "Manhattan";
However, if we wanted to filter to all records except those records in Brooklyn it is more efficient to use the <> operator, which means “not equal”.
1
SELECT * FROM AB_NYC_2019
2
WHERE neighbourhood_group <> "Brooklyn",;
!= also works as a not equal operator in SQL.
1
SELECT * FROM AB_NYC_2019
2
WHERE neighbourhood_group != "Brooklyn";
Basic Aggregate Functions
The syntax for basic aggregate functions is quite similar between Excel and SQL.
Excel | SQL | |
---|---|---|
Mean | AVERAGE() | AVG() |
Sum | Sum() | Sum() |
Min | Min() | Min() |
Max | Max() | Max() |
Count | Count() | Count() |
For instance, if we want the obtain the mean for a given column in Excel, say price in column J, we write the following formula in an empty cell: =AVERAGE(J:J), which returns a value of 152.720687.
In SQL, running SELECT AVG(price) FROM AB_NYC_2019 also returns a value of 152.720687 (when rounded).
Aggregate Functions by Group
The previous section showed how to use basic aggregate functions to return statistical measures on the entire dataset. However, there are cases where we will want to conduct an aggregate function like the mean within specific groups in the data. In the current example, let’s say that we want to see the mean price by neighborhood groups.
Excel
A straightforward way to conduct aggregations by groups in Excel is to use pivot tables. We can set up a pivot table in a new sheet by navigating to the Insert menu and clicking on Insert Pivot Table and From Table/Range. By default, the entire sheet should be selected.
The screen below shows how to create a table of the mean prices by neighbourhood group once the pivot table opens in a new sheet.
SQL
We are going to add some more code to our previous SQL query, SELECT AVG(price) FROM AB_NYC_2019. First, let’s add neighbourhood_group as an additional column to select, and be sure to separate selected columns with commas. Now, if you execute SELECT neighbourhood_group, AVG(price) FROM AB_NYC_2019 the command will run without errors, but the output will not be the desired output.
Why is there only one neighbourhood group value? In short, because the SQL query above provides instructions to take the mean of price across the entire table, which produces a single value. As such, the resulting output can only have a single row, so the neighbourhood_group value for the first record in the data table is returned in the neighbourhood_group column in the output.
Is there a solution to obtain the desired output? Yes! We need to add a GROUP BY statement after the SELECT statement, which provides an instruction to perform any aggregations by values of the column following GROUP BY. In our example, the code is as follows.
1
SELECT neighbourhood_group, AVG(price) FROM AB_NYC_2019
2
GROUP BY neighbourhood_group,;
You may notice that the resulting output does not include the Grand Total row that was output by default in the Excel pivot table. However, the mean price for each neighbourhood group is returned with the SQL query above.
Wrapping Up
With the examples above, you now have a reference for conducting basic data analytic tasks in both Excel and SQL. So go on and try these examples out on your own.