Tutorial

This is a quick tutorial of some basic commands and usage patterns of LETSQL. We’ll cover selecting data, filtering, grouping, aggregating, and ordering.

Prerequisite

Please start by following the installation instructions, and then install pandas.

Loading the dataset

LETSQL can work with several file types, but at its core is an Ibis backend so it connects to existing databases and interacts with the data there.

We’ll use the iris dataset, which is a classic dataset used in machine learning and statistics. It contains measurements of four features (sepal length, sepal width, petal length, petal width) for 150 flowers from three species (setosa, versicolor, virginica).

import pandas as pd

import letsql as ls

con = ls.connect()

# Load the dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

# Register it
iris = con.register(df, "iris")

You can now see the example dataset copied over to the database

con.list_tables()
['iris']

There’s one table, called iris. We can interact with it through the object that was returned when registering the table. Or we can use the table method to get it.

iris = con.table("iris")

LETSQL is an Ibis Backend so it is lazily evaluated, so instead of seeing the data, we see the schema of the table, instead. To peek at the data, we can call head and then execute to get the first few rows of the table.

iris.head().execute()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Interactive mode

For the rest of this intro, we’ll turn on interactive mode, which partially executes queries to give users a preview of the results. There is a small difference in the way the output is formatted, but otherwise this is the same as calling to_pandas on the table expression with a limit of 10 result rows returned.

import ibis

ibis.options.interactive = True
iris.head()
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ sepal_length  sepal_width  petal_length  petal_width  species ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ float64float64float64float64string  │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│          5.13.51.40.2setosa  │
│          4.93.01.40.2setosa  │
│          4.73.21.30.2setosa  │
│          4.63.11.50.2setosa  │
│          5.03.61.40.2setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Common operations

LETSQL has a collection of useful table methods to manipulate and query the data in a table (or tables).

select

Selecting data involves choosing specific columns from the table. To select a column you can use the name of the column as a string:

# Select the 'sepal_length' and 'sepal_width' columns
iris.select("sepal_length", "sepal_width")
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ sepal_length  sepal_width ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ float64float64     │
├──────────────┼─────────────┤
│          5.13.5 │
│          4.93.0 │
│          4.73.2 │
│          4.63.1 │
│          5.03.6 │
│          5.43.9 │
│          4.63.4 │
│          5.03.4 │
│          4.42.9 │
│          4.93.1 │
│             │
└──────────────┴─────────────┘

selectors

Typing out ALL of the column names except one is a little annoying. Instead of doing that again, we can use a selector to quickly select or deselect groups of columns.

import ibis.selectors as s

iris.select(
    ~s.matches("sepal_length")
    # match every column except `sepal_length`
)
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ sepal_width  petal_length  petal_width  species ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ float64float64float64string  │
├─────────────┼──────────────┼─────────────┼─────────┤
│         3.51.40.2setosa  │
│         3.01.40.2setosa  │
│         3.21.30.2setosa  │
│         3.11.50.2setosa  │
│         3.61.40.2setosa  │
│         3.91.70.4setosa  │
│         3.41.40.3setosa  │
│         3.41.50.2setosa  │
│         2.91.40.2setosa  │
│         3.11.50.1setosa  │
│                  │
└─────────────┴──────────────┴─────────────┴─────────┘

You can also use a selector alongside a column name.

iris.select("species", s.numeric())
┏━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ species  sepal_length  sepal_width  petal_length  petal_width ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ stringfloat64float64float64float64     │
├─────────┼──────────────┼─────────────┼──────────────┼─────────────┤
│ setosa 5.13.51.40.2 │
│ setosa 4.93.01.40.2 │
│ setosa 4.73.21.30.2 │
│ setosa 4.63.11.50.2 │
│ setosa 5.03.61.40.2 │
│ setosa 5.43.91.70.4 │
│ setosa 4.63.41.40.3 │
│ setosa 5.03.41.50.2 │
│ setosa 4.42.91.40.2 │
│ setosa 4.93.11.50.1 │
│  │
└─────────┴──────────────┴─────────────┴──────────────┴─────────────┘

filter

Filtering data involves selecting rows that meet a condition or a set of conditions. We can filter so we only have flowers with sepal_length greater than 5.0

# Filter the dataset for flowers with sepal length greater than 5.0
iris.filter(iris.sepal_length > 5.0)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ sepal_length  sepal_width  petal_length  petal_width  species ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ float64float64float64float64string  │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│          5.13.51.40.2setosa  │
│          5.43.91.70.4setosa  │
│          5.43.71.50.2setosa  │
│          5.84.01.20.2setosa  │
│          5.74.41.50.4setosa  │
│          5.43.91.30.4setosa  │
│          5.13.51.40.3setosa  │
│          5.73.81.70.3setosa  │
│          5.13.81.50.3setosa  │
│          5.43.41.70.2setosa  │
│                   │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Additionally, or filter for Setosa flowers with sepal length greater than 5.0:

# Filter the dataset for flowers with sepal length greater than 5.0 that are setosa
iris.filter((iris.sepal_length > 5.0) & (iris.species == "setosa"))
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ sepal_length  sepal_width  petal_length  petal_width  species ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ float64float64float64float64string  │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│          5.13.51.40.2setosa  │
│          5.43.91.70.4setosa  │
│          5.43.71.50.2setosa  │
│          5.84.01.20.2setosa  │
│          5.74.41.50.4setosa  │
│          5.43.91.30.4setosa  │
│          5.13.51.40.3setosa  │
│          5.73.81.70.3setosa  │
│          5.13.81.50.3setosa  │
│          5.43.41.70.2setosa  │
│                   │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

order_by

Ordering data involves sorting the rows of the DataFrame based on the values in one or more columns. This can be achieved with order_by because it arranges the values of one or more columns in ascending or descending order.

# Sort the dataset by 'sepal_length' in ascending order
iris.order_by(iris.sepal_length)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ sepal_length  sepal_width  petal_length  petal_width  species ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ float64float64float64float64string  │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│          4.33.01.10.1setosa  │
│          4.43.21.30.2setosa  │
│          4.43.01.30.2setosa  │
│          4.42.91.40.2setosa  │
│          4.52.31.30.3setosa  │
│          4.63.11.50.2setosa  │
│          4.63.21.40.2setosa  │
│          4.63.61.00.2setosa  │
│          4.63.41.40.3setosa  │
│          4.73.21.60.2setosa  │
│                   │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

You can sort in descending order using the desc method of a column:

# Sort the dataset by 'sepal_length' in descending order
iris.order_by(iris.sepal_length.desc())
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ sepal_length  sepal_width  petal_length  petal_width  species   ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ float64float64float64float64string    │
├──────────────┼─────────────┼──────────────┼─────────────┼───────────┤
│          7.93.86.42.0virginica │
│          7.72.66.92.3virginica │
│          7.72.86.72.0virginica │
│          7.73.86.72.2virginica │
│          7.73.06.12.3virginica │
│          7.63.06.62.1virginica │
│          7.42.86.11.9virginica │
│          7.32.96.31.8virginica │
│          7.23.66.12.5virginica │
│          7.23.05.81.6virginica │
│                     │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

aggregates

LETSQL has the same aggregate functions as Ibis to help summarize data.

mean, max, min, count, sum (the list goes on).

To aggregate an entire column, call the corresponding method on that column.

iris.sepal_length.mean()

┌──────────┐
│ 5.843333 │
└──────────┘

You can compute multiple aggregates at once using the aggregate method:

iris.aggregate([iris.sepal_length.mean(), iris.petal_width.max()])
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Mean(sepal_length)  Max(petal_width) ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ float64float64          │
├────────────────────┼──────────────────┤
│           5.8433332.5 │
└────────────────────┴──────────────────┘

But aggregate really shines when it’s paired with group_by.

group_by

Grouping data involves splitting the data into groups based on some criteria. For creating the groupings of rows that have the same value for one or more columns use group_by.

To compute summary statistics for each group pair it with aggregate to get a result.

# Group the dataset by the 'species' column
iris.group_by('species').aggregate()
┏━━━━━━━━━━━━┓
┃ species    ┃
┡━━━━━━━━━━━━┩
│ string     │
├────────────┤
│ setosa     │
│ versicolor │
│ virginica  │
└────────────┘

We grouped by the species column and handed it an “empty” aggregate command. The result of that is a column of the unique values in the species column.

Now, if we add an aggregation function to that, we start to really open things up.

iris.group_by("species").aggregate(iris.sepal_length.mean())
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species     Mean(sepal_length) ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ stringfloat64            │
├────────────┼────────────────────┤
│ versicolor5.936 │
│ virginica 6.588 │
│ setosa    5.006 │
└────────────┴────────────────────┘

Conclusion

In this tutorial, we covered the basics of selecting, filtering, grouping, aggregating, and ordering data using the Iris dataset in LETSQL. Explore further by applying these techniques to your datasets and combining them to perform more complex data manipulations.