September 20th, 2021

Introducing DatatableTon – Python Datatable Tutorials & Exercises

RSS icon RSS Category: datatable, Open Source, Python, Tutorials

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data processing and has a flexible API.

If this reminds you of R’s data.table, you are spot on because Python’s datatable package is closely related to and inspired by the R library.

The release of v1.0.0 was done on 1st July, 2021 and it’s probably a good time to begin exploring the package.

Notebooks are one of the best forms of learning about packages and deep-diving into them. It is convenient, enables a hands-on experience and often goes hand-in-hand with crisp documentation.

DatatableTon: 💯 datatable exercises

DatatableTon is an open source project consisting of 100 Python datatable exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts.

✅ Structured as exercises & tutorials – Choose your style
✅ Suitable for beginners, intermediates & experts – Choose your level
✅ Available on Colab, Kaggle, Binder & GitHub – Choose your platform


  • For beginners looking to learn datatable from scratch, it is recommended to go through all the sets from the beginning and in order. They are structured to make it easy for newcomers to get started and learn quickly.
  • For intermediates looking to up-skill themselves on datatable, it is recommended to start with Set 5 or Set 6 and go through all the subsequent sets in order.
  • For experts looking to practise more of datatable, it is recommended to test yourself on the last two sets: Set 9 and Set 10.


  • For teachers looking at exercises to test students on, it is recommended to use all the Exercises style of the sets.
  • For teachers looking at tutorials to present or teach with, it is recommended to use all the Solutions style of the sets.


Each section of DatatableTon is a Jupyter Notebook designed to showcase a specific capability of the package ranging from basic setup and data processing to machine learning models and complete projects.

Set 01 • Datatable Introduction • Beginner • Exercises 1–10

  • Installation and setup of the package
  • Creating and displaying data
  • Viewing data and its details
import datatable as dt

data = dt.Frame(v1=range(10), v2=['Y', 'O', 'U', 'C', 'A', 'N', 'D', 'O', 'I', 'T'])

Set 02 • Files and Formats • Beginner • Exercises 11–20

  • Reading/Writing csv, gz, jay, zip files or urls
  • Integrating pandas, numpy, arrow formats
  • Using lists, dicts and tuples with frames
import datatable as dt
import pandas as pd

dframe = pd.DataFrame({'v1': range(11), 
                       'v2': ['N', 'E', 'V', 'E', 'R', 'G', 'I', 'V', 'E', 'U', 'P']})
data_pd = dt.Frame(dframe)
pd_data = data_pd.to_pandas()

Set 03 • Data Selection • Beginner • Exercises 21–30

  • Select row(s)/column(s)/slice(s)/element(s)
  • Filter row(s)/column(s) using single or multiple heuristics
  • Remove missing row(s)/column(s) and drop duplicates
import datatable as dt

data = dt.fread('datatableton_sample.csv')
data_upq = data[:, ['user', 'product', 'quantity']]

Set 04 • Frame Operations • Beginner • Exercises 31–40

  • Change column names and types
  • Create, update, delete row(s)/column(s)
  • Impute and set missing values
import datatable as dt

data = dt.fread('titanic.csv')
data.replace('?', None)

Set 05 • Column Aggregations • Beginner • Exercises 41–50

  • Calculate count, sum, min, max, mean, median, mode, sd, skew, kurt
  • Covariance of columns
  • Feature correlations and correlation matrix
import datatable as dt
from sklearn.datasets import load_wine

data = dt.Frame(load_wine(as_frame=True).frame)

Set 06 • Grouping Methods • Intermediate • Exercises 51–60

  • Aggregating metrics grouped by features
  • Comparing column statistics grouped by features
  • Combining groupings with filtering and sorting
import datatable as dt
from seaborn import load_dataset

data = dt.Frame(load_dataset('penguins'))
data.replace('NA', None)
data[:, dt.median(dt.f.body_mass_g),[dt.f.species,])]

Set 07 • Multiple Frames • Intermediate • Exercises 61–70

  • Read, rbind, cbind multiple frames
  • Join frames using single or multiple keys
  • Union, intersection, difference of frames
import datatable as dt

data = list(dt.iread(''))

orders_jan = data[1]
orders_feb = data[0]
orders_mar = data[2]
orders_all = dt.rbind(orders_jan, orders_feb, orders_mar)
returns = data[3]

orders_all.key = 'Order ID'
sales = returns[:, :, dt.join(orders_all)]

Set 08 • Time Series • Intermediate • Exercises 71–80

  • Extracting and creating date/time features
  • Creating lag and lead variables within/without groups
  • Calculating difference of dates/timestamps
import datatable as dt

data = dt.fread('datatableton_sample.csv')
data['previous_timestamp'] = dt.shift(dt.f.timestamp, n=1)

Set 09 • Native FTRL • Expert • Exercises 81–90

  • Initialization and hyperparameters of FTRL model
  • Training and scoring a FTRL model
  • Perform k-fold cross validation
import datatable as dt
from datatable.models import Ftrl

data = dt.fread('kdd_ctr.csv', fill=True)[1:,:]
target = data['click']
del data['click']

model_ftrl = Ftrl(), target)

Set 10 • Capstone Projects • Expert • Exercises 91–100

  • End-to-end workflow on multiple datasets
  • Kaggle competition datasets and actual submissions
  • Explore your own datasets and use-cases
import datatable as dt
from datatable.models import Ftrl

train = dt.fread('tradeshift-text-classification/train.csv.gz')
test = dt.fread('tradeshift-text-classification/test.csv.gz')
train_labels = dt.fread('tradeshift-text-classification/trainLabels.csv.gz')

test_ids = test['id']

del train['id']
del test['id']

submission = dt.Frame()

for target in train_labels.names[1:]:
    print(f'Model for target {target}')

    model_ftrl = Ftrl(nepochs=5, nbins=10**8, lambda1=0.1), train_labels[target])

    preds_test = model_ftrl.predict(test)
    submission_target = dt.Frame(id_label=test_ids[:, dt.as_type(, str) + f'_{target}'], pred=preds_test['True'])



DatatableTon is open-source and freely available on GitHub. Special thanks to Parul Pandey & Shrinidhi Narasimhan for collaborating 🙏

This article was first published on Medium.

About the Author

Rohan Rao

I'm a Machine Learning Engineer and Kaggle Quadruple Grandmaster with over 7 years of experience building data science products in various industries and projects like digital payments, e-commerce retail, credit risk, fraud prevention, growth, logistics, and more. I enjoy working on competitions, hackathons and collaborating with folks around the globe on building solutions. I completed my post-graduation in Applied Statistics from IIT-Bombay in 2013. Solving sudokus and puzzles have been my big hobby for over a decade. Having won the national championship multiple times, I've represented India and been in the top-10 in the World, as well as have won a silver medal at the Asian Championships. My dream is to make 'Person of Interest' a reality. You can find me on LinkedIn and follow me on Twitter.

Leave a Reply

H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs

Convert unstructured datasets to Question-answer pairs required for LLM fine-tuning and other downstream tasks with

September 22, 2023 - by Genevieve Richards, Tarique Hussain and Shivam Bansal
Building a Fraud Detection Model with H2O AI Cloud

In a previous article[1], we discussed how machine learning could be harnessed to mitigate fraud.

July 28, 2023 - by Asghar Ghorbani
A Look at the UniformRobust Method for Histogram Type

Tree-based algorithms, especially Gradient Boosting Machines (GBM's), are one of the most popular algorithms used.

July 25, 2023 - by Hannah Tillman and Megan Kurka
H2O LLM EvalGPT: A Comprehensive Tool for Evaluating Large Language Models

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications,

July 19, 2023 - by Srinivas Neppalli, Abhay Singhal and Michal Malohlava
Testing Large Language Model (LLM) Vulnerabilities Using Adversarial Attacks

Adversarial analysis seeks to explain a machine learning model by understanding locally what changes need

July 19, 2023 - by Kim Montgomery, Pramit Choudhary and Michal Malohlava
Reducing False Positives in Financial Transactions with AutoML

In an increasingly digital world, combating financial fraud is a high-stakes game. However, the systems

July 14, 2023 - by Asghar Ghorbani

Ready to see the platform in action?

Make data and AI deliver meaningful and significant value to your organization with our state-of-the-art AI platform.