import pandas as pd
import letsql as ls
= ls.connect()
con
# Load the dataset
= "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
url = pd.read_csv(url)
df
# Register it
= con.register(df, "iris") iris
Tutorial
This is a quick tutorial of some basic commands and usage patterns of LETSQL. We’ll cover selecting data, filtering, grouping, aggregating, and ordering.
Prerequisite
Please start by following the installation instructions, and then install pandas.
Loading the dataset
LETSQL can work with several file types, but at its core is an Ibis backend so it connects to existing databases and interacts with the data there.
We’ll use the iris dataset, which is a classic dataset used in machine learning and statistics. It contains measurements of four features (sepal length, sepal width, petal length, petal width) for 150 flowers from three species (setosa, versicolor, virginica).
You can now see the example dataset copied over to the database
con.list_tables()
['iris']
There’s one table, called iris
. We can interact with it through the object that was returned when registering the table. Or we can use the table method to get it.
= con.table("iris") iris
LETSQL is an Ibis Backend so it is lazily evaluated, so instead of seeing the data, we see the schema of the table, instead. To peek at the data, we can call head
and then execute
to get the first few rows of the table.
iris.head().execute()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Interactive mode
For the rest of this intro, we’ll turn on interactive mode, which partially executes queries to give users a preview of the results. There is a small difference in the way the output is formatted, but otherwise this is the same as calling to_pandas
on the table expression with a limit of 10 result rows returned.
import ibis
= True
ibis.options.interactive iris.head()
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │ │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Common operations
LETSQL has a collection of useful table methods to manipulate and query the data in a table (or tables).
select
Selecting data involves choosing specific columns from the table. To select a column you can use the name of the column as a string:
# Select the 'sepal_length' and 'sepal_width' columns
"sepal_length", "sepal_width") iris.select(
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├──────────────┼─────────────┤ │ 5.1 │ 3.5 │ │ 4.9 │ 3.0 │ │ 4.7 │ 3.2 │ │ 4.6 │ 3.1 │ │ 5.0 │ 3.6 │ │ 5.4 │ 3.9 │ │ 4.6 │ 3.4 │ │ 5.0 │ 3.4 │ │ 4.4 │ 2.9 │ │ 4.9 │ 3.1 │ │ … │ … │ └──────────────┴─────────────┘
selectors
Typing out ALL of the column names except one is a little annoying. Instead of doing that again, we can use a selector
to quickly select or deselect groups of columns.
import ibis.selectors as s
iris.select(~s.matches("sepal_length")
# match every column except `sepal_length`
)
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ string │ ├─────────────┼──────────────┼─────────────┼─────────┤ │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 3.0 │ 1.4 │ 0.2 │ setosa │ │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 3.6 │ 1.4 │ 0.2 │ setosa │ │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 3.4 │ 1.4 │ 0.3 │ setosa │ │ 3.4 │ 1.5 │ 0.2 │ setosa │ │ 2.9 │ 1.4 │ 0.2 │ setosa │ │ 3.1 │ 1.5 │ 0.1 │ setosa │ │ … │ … │ … │ … │ └─────────────┴──────────────┴─────────────┴─────────┘
You can also use a selector
alongside a column name.
"species", s.numeric()) iris.select(
┏━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ species ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ ┡━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ float64 │ float64 │ ├─────────┼──────────────┼─────────────┼──────────────┼─────────────┤ │ setosa │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ │ setosa │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ │ setosa │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ │ setosa │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ │ setosa │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ │ setosa │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ │ setosa │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ │ setosa │ 5.0 │ 3.4 │ 1.5 │ 0.2 │ │ setosa │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ │ setosa │ 4.9 │ 3.1 │ 1.5 │ 0.1 │ │ … │ … │ … │ … │ … │ └─────────┴──────────────┴─────────────┴──────────────┴─────────────┘
filter
Filtering data involves selecting rows that meet a condition or a set of conditions. We can filter so we only have flowers with sepal_length
greater than 5.0
# Filter the dataset for flowers with sepal length greater than 5.0
filter(iris.sepal_length > 5.0) iris.
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 5.4 │ 3.7 │ 1.5 │ 0.2 │ setosa │ │ 5.8 │ 4.0 │ 1.2 │ 0.2 │ setosa │ │ 5.7 │ 4.4 │ 1.5 │ 0.4 │ setosa │ │ 5.4 │ 3.9 │ 1.3 │ 0.4 │ setosa │ │ 5.1 │ 3.5 │ 1.4 │ 0.3 │ setosa │ │ 5.7 │ 3.8 │ 1.7 │ 0.3 │ setosa │ │ 5.1 │ 3.8 │ 1.5 │ 0.3 │ setosa │ │ 5.4 │ 3.4 │ 1.7 │ 0.2 │ setosa │ │ … │ … │ … │ … │ … │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Additionally, or filter for Setosa flowers with sepal length greater than 5.0:
# Filter the dataset for flowers with sepal length greater than 5.0 that are setosa
filter((iris.sepal_length > 5.0) & (iris.species == "setosa")) iris.
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 5.4 │ 3.7 │ 1.5 │ 0.2 │ setosa │ │ 5.8 │ 4.0 │ 1.2 │ 0.2 │ setosa │ │ 5.7 │ 4.4 │ 1.5 │ 0.4 │ setosa │ │ 5.4 │ 3.9 │ 1.3 │ 0.4 │ setosa │ │ 5.1 │ 3.5 │ 1.4 │ 0.3 │ setosa │ │ 5.7 │ 3.8 │ 1.7 │ 0.3 │ setosa │ │ 5.1 │ 3.8 │ 1.5 │ 0.3 │ setosa │ │ 5.4 │ 3.4 │ 1.7 │ 0.2 │ setosa │ │ … │ … │ … │ … │ … │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
order_by
Ordering data involves sorting the rows of the DataFrame based on the values in one or more columns. This can be achieved with order_by
because it arranges the values of one or more columns in ascending or descending order.
# Sort the dataset by 'sepal_length' in ascending order
iris.order_by(iris.sepal_length)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 4.3 │ 3.0 │ 1.1 │ 0.1 │ setosa │ │ 4.4 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.4 │ 3.0 │ 1.3 │ 0.2 │ setosa │ │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ setosa │ │ 4.5 │ 2.3 │ 1.3 │ 0.3 │ setosa │ │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 4.6 │ 3.2 │ 1.4 │ 0.2 │ setosa │ │ 4.6 │ 3.6 │ 1.0 │ 0.2 │ setosa │ │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │ │ 4.7 │ 3.2 │ 1.6 │ 0.2 │ setosa │ │ … │ … │ … │ … │ … │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
You can sort in descending order using the desc
method of a column:
# Sort the dataset by 'sepal_length' in descending order
iris.order_by(iris.sepal_length.desc())
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼───────────┤ │ 7.9 │ 3.8 │ 6.4 │ 2.0 │ virginica │ │ 7.7 │ 2.6 │ 6.9 │ 2.3 │ virginica │ │ 7.7 │ 2.8 │ 6.7 │ 2.0 │ virginica │ │ 7.7 │ 3.8 │ 6.7 │ 2.2 │ virginica │ │ 7.7 │ 3.0 │ 6.1 │ 2.3 │ virginica │ │ 7.6 │ 3.0 │ 6.6 │ 2.1 │ virginica │ │ 7.4 │ 2.8 │ 6.1 │ 1.9 │ virginica │ │ 7.3 │ 2.9 │ 6.3 │ 1.8 │ virginica │ │ 7.2 │ 3.6 │ 6.1 │ 2.5 │ virginica │ │ 7.2 │ 3.0 │ 5.8 │ 1.6 │ virginica │ │ … │ … │ … │ … │ … │ └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
aggregates
LETSQL has the same aggregate functions as Ibis to help summarize data.
mean
, max
, min
, count
, sum
(the list goes on).
To aggregate an entire column, call the corresponding method on that column.
iris.sepal_length.mean()
┌──────────┐
│ 5.843333 │
└──────────┘
You can compute multiple aggregates at once using the aggregate
method:
max()]) iris.aggregate([iris.sepal_length.mean(), iris.petal_width.
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ Mean(sepal_length) ┃ Max(petal_width) ┃ ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├────────────────────┼──────────────────┤ │ 5.843333 │ 2.5 │ └────────────────────┴──────────────────┘
But aggregate
really shines when it’s paired with group_by
.
group_by
Grouping data involves splitting the data into groups based on some criteria. For creating the groupings of rows that have the same value for one or more columns use group_by
.
To compute summary statistics for each group pair it with aggregate
to get a result.
# Group the dataset by the 'species' column
'species').aggregate() iris.group_by(
┏━━━━━━━━━━━━┓ ┃ species ┃ ┡━━━━━━━━━━━━┩ │ string │ ├────────────┤ │ versicolor │ │ virginica │ │ setosa │ └────────────┘
We grouped by the species
column and handed it an “empty” aggregate command. The result of that is a column of the unique values in the species
column.
Now, if we add an aggregation function to that, we start to really open things up.
"species").aggregate(iris.sepal_length.mean()) iris.group_by(
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ Mean(sepal_length) ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ ├────────────┼────────────────────┤ │ versicolor │ 5.936 │ │ virginica │ 6.588 │ │ setosa │ 5.006 │ └────────────┴────────────────────┘
Conclusion
In this tutorial, we covered the basics of selecting, filtering, grouping, aggregating, and ordering data using the Iris dataset in LETSQL. Explore further by applying these techniques to your datasets and combining them to perform more complex data manipulations.