Recommended Pre-Reading
A historical perspective on enterprise data management
A great way to learn about the evolution of enterprise data management and how
it has shaped the relational database management systems we use today.
Creating SQLite Databases from CSV Files
A 30-min, step-by-step guide to creating SQLite databases from CSV files.
Introduction to SQL Queries w/ Pandas
A Jupyter Notebook that highlights the 15 most common SQL queries and how to perform them in Pandas.
Introduction
Much of the data we work with in the financial industry is stored in databases and accessed using SQL. To really unlock the power of data, your ability and general familiarity with SQL will be a huge advantage in your career, and is the reason why it’s often a requirement for many data-related roles. This module will cover the theoretical and practical aspects of database management and the querying language SQL. If you do not have a database connection set up (as is expected for most learners and beginners), the best alternative is to use SQLite, which is a lightweight database engine that can be used without a server. Follow the instructions in the Recommended Pre-Reading section above to set up a SQLite database from a CSV file. If you’re following this module as part of a course, you will be required to complete that setup to be able to follow along with the lessons.Grading Rubric
- Convert a CSV to a SQLite Database, 4 points
- SQL Queries in Practice, 12 points
- Renaming columns with
ALTER TABLE, 1 point SELECT,LIMITandOFFSETin queries, 1 point- Returning unique values with
DISTINCT, 1 point - Filtering data with
WHERE, 1 point - Pattern matching with
LIKE, 1 point - Joining tables with
LEFT JOIN, 1 point - Aggregation and grouping with
GROUP BY, 1 point - Using
CAST ASto change data types, 1 point - Using
HAVING TOto filter grouped data, 1 point - Using subquery to nest a
SELECTunder anotherSELECT, 1 point - Using
BETWEENto filter data within a range, 1 point - Using
CREATE VIEWto pack a query into a named object, 1 point
- Renaming columns with
Connecting to a Database
Very often, to make a connection to a database, you will need to provide a connection string and use a library to help with the connection. Among the most popular choices for Python users would besqlalchemy (additional lessons to SQLAlchemy)
which acts as an Object Relational Mapper (ORM), but there are also libraries specific to the database:
- PostgreSQL:
psycopg2 - MySQL:
mysql-connector-pythonormysqlclient
Using pandas
Using pandas
Using sqlalchemy
Using sqlalchemy
Using psycopg2
Using psycopg2
Database queries with Pandas
SQL stands for Structured Query Language, and is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).Notebook Exercise
Looking for some structured practice with using SQL directly in Pandas?Try your hands on the exercises in 4_SQL_Query before moving into
the lab exercises below.
Lab
All of the following exercise is to be performed on thesqlite database you created from the CSV files:
metadatatable: fromidx_subsector_metadata_080824.csvreportstable: fromidx_sector_reports_080824.csv- the database itself should be called
sectors.db
Renaming columns with ALTER TABLE and RENAME COLUMN
Let’s start off simple. Create a file and name it 01_rename.sql. Use ALTER TABLE <table_name> RENAME COLUMN <existing_name> TO <new_name> to make alterations to the column in metadata
named sub_sector_id. Rename it to sub_sector_uid.
SELECT, LIMIT and OFFSET in queries
Now create a second file and call it 02_select.sql.
Use SELECT to query the reports table and return the first 10 rows with the following columns:
sub_sectortotal_companiesfiltered_median_pe
OFFSET clause to your query to get the first 10 rows after the first 5 rows (skipping the first 5 rows).
Return unique values with DISTINCT and as aliasing
Create a third file and name it 03_distinct.sql.
Use SELECT to query the metadata table and return the first 10 unique sectors in the table.
Add a DISTINCT clause to your query to return only unique values in the
sector column. This will remove any duplicate values from the result set.
Name the column Indonesian Sectors using the AS keyword.
LIMIT clause to return only the first 10 rows.
Filtering data with WHERE
Create a fourth file and name it 04_where.sql.
From the reports table, select only rows where the sub-sector has
at least 15 companies and the average market cap of that sector (avg_market_cap) is lesser than 500_000_000_000 (“500 billion”).
Return the following columns:
sub_sectortotal_companiesavg_market_cap
AND keyword.
Pattern matching with LIKE.
Create a fifth file and name it 05_like.sql.
In SQL, you can use wildcards for constructing patterns.
-
%wildcard matches any sequence of zero or more characters. -
_wildcard matches any single character. -
b%matchesbank,bytes,bit -
%ionmatchesinstitution,renovation -
%it% matches any string containingit` -
b_nkmatchesbank,bunk -
l__nmatchesloan,lean
PRAGMA case_sensitive_like = true;
Select all rows from the metadata table where the sub_sector column contains the word Goods or Services (both case insensitive),
regardless of the position of these words in the string.
Return all columns.
Joining tables with LEFT JOIN
Create a sixth file and name it 06_left_join.sql.
Use a LEFT JOIN to join the metadata and reports tables on the sub_sector column.
Return the following columns:
sectorsub_sectortotal_companiesavg_market_cap
Aggregation and grouping with GROUP BY
Create a seventh file and name it 07_group_by.sql.
This file should contain two queries.
For the first query, use COUNT to return the number of rows in each group
of sector in the metadata table. This gives us an idea of how many sub sectors
are in each sector.
Name the columns sector and total_sub_sectors.
AVG to return the average number of companies of each sub sector by
grouping on the sector column.
Name the columns sector, sub_sector, total_companies, and avg_sub_sector_companies.
This query should be similar to the one in your LEFT JOIN exercise
since you will need to join the two tables to get the total_companies column.
Using CAST AS and ORDER BY
Since aggregation is commonly applied on numeric columns, this may present a little bit of problem when our columns are in character strings. You may alter the database table and change the column types, or you may use CAST AS to cast the column values from character to either integer or decimal.
Create an eighth file and name it 08_cast_as.sql.
Use CAST AS to change the total_companies and avg_market_cap columns in the reports table to an integer.
Finally, use ORDER BY to sort the result set by the total_companies column in descending order. Return
only the first ten rows and the following columns:
sub_sectortotal_companiesas companies_numavg_market_capas market_cap_average
Using HAVING TO to filter grouped data
HAVING has to be used with GROUP BY. They act similarly to how WHERE works on rows, except HAVING filters on the group levels.
Create a ninth file and name it 09_having.sql.
Perform the second exercise in the seventh file again, but this
time return only sectors where the avg_sub_sector_companies is greater than 30.
Use a subquery to nest a SELECT under another
Subqueries allow you to put together more complex queries while maintaining some readability. Fundamentally, a subquery is just a nested level of SELECT statement.
Create a tenth file and name it 10_subquery.sql.
First create a subquery that returns all rows from metadata where the sector is either Financials or Technology (case-sensitive!).
Then use this subquery to filter the reports table, so the returned set
only contains rows where the sub_sector is in the (earlier) subquery result.
Return the following columns in your query:
sub_sectormin_company_pemax_company_pe
Using BETWEEN to filter data within a range
Create a eleventh file and name it 11_between.sql.
Your query should be similar to the one in the tenth file above,
with a small adjustment.
- In your subquery (the inner query), return all rows from
metadatawhere thesectoris eitherFinancialsorInfrastructuresand when thesub_sector_uidis between 3 and 20. - In your outer query, return only rows where the
avg_yoy_q_revenue_growthis between 0 and 0.3.
BETWEEN is inclusive on both ends, so BETWEEN 0 AND 17 includes 0 and 17.
Return the following columns:
sub_sectoravg_yoy_q_revenue_growthmin_company_pemax_company_pe
- If you have performed the exercises earlier, your column names may have be altered. We’re working with the
sub_sector_uidcolumn after the renaming exercise. BETWEEN, as you’d expect, expect two values separated byANDand that these are numbers or dates. You might have to cast the column to a number if it’s not already.
Using CREATE VIEW to pack a query into a named object
For the final exercise, create a file and name it 12_create_view.sql.
A view is the result set of a stored query. It is a virtual table you define
in the database that can be queried like a table, except that it
does not store data.
Once you have defined a view, you can query it like a table — it can include
columns from any number of tables, involving complex joins, filters and
aggregations.
We create a view with the CREATE VIEW statement.
watchlist from reports where the following conditions are met:
avg_market_capis greater than the average ofavg_market_cap. This means we’re only watching sectors with companies that are generally larger than the average.avg_yoy_q_revenue_growthis greater than 0.1. This narrow our watchlist to sectors with companies that are growing at a rate of at least 10% year-on-year.median_yield_ttmgreater than 0.03. This further narrows our watchlist to sectors with companies that are, on average, paying out at least 3% worth of dividiend yield.
PRAGMA to print out the columns of the watchlist view.
Benefits and Advantages of Views
Views present the following benefits:- Simplify complex queries: Views can encapsulate complex queries and present a simplified, more understandable interface. Users can query the view without needing to know the details of the underlying tables and joins.
- Data security: Views can restrict access to sensitive data. By creating views that only include certain columns or rows, you can control what data users see and access.
- Consistency: Views ensure consistent results for complex queries. If you have a commonly used query, encapsulating it in a view ensures that any changes to the query logic are applied uniformly across all users and applications accessing that view.
- Abstraction: Views abstract the underlying database schema. If the schema changes, you can often adjust the view without modifying the application code that uses the view.
- Maintainaibility: If you need to update how data is presented or aggregated, you can modify the view definition without changing the application code that uses it.
- Materialized views: Some databases support materialized views, which store the results of the view query in a table-like structure. This can improve query performance by pre-computing the results and storing them for faster access. In cases where materialized view isn’t supported, you can create a table from the view. It is important to distinguish this from regular views, which are virtual and are generally just queries that are run when the view is queried.

