Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data processing and has a flexible API.
If this reminds you of R’s data.table , you are spot on because Python’s datatable package is closely related to and inspired by the R library.
The release of v1.0.0 was done on 1st July, 2021 and it’s probably a good time to begin exploring the package.
Notebooks are one of the best forms of learning about packages and deep-diving into them. It is convenient, enables a hands-on experience and often goes hand-in-hand with crisp documentation.
DatatableTon is an open so urce project consisting of 100 Python datatable exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts.
Structured as exercises & tutorials – Choose your style
Suitable for beginners, intermediates & experts – Choose your level
Available on Colab, Kaggle, Binder & GitHub – Choose your platform
data[data.f.set ≥ mylevel]
data[data.f.style == mystyle]
Each section of DatatableTon is a Jupyter Notebook designed to showcase a specific capability of the package ranging from basic setup and data processing to machine learning models and complete projects.
import datatable as dt
data = dt.Frame(v1=range(10), v2=['Y', 'O', 'U', 'C', 'A', 'N', 'D', 'O', 'I', 'T'])
import datatable as dt
import pandas as pd
dframe = pd.DataFrame({'v1': range(11),
'v2': ['N', 'E', 'V', 'E', 'R', 'G', 'I', 'V', 'E', 'U', 'P']})
data_pd = dt.Frame(dframe)
pd_data = data_pd.to_pandas()
import datatable as dt
data = dt.fread('datatableton_sample.csv')
data_upq = data[:, ['user', 'product', 'quantity']]
import datatable as dt
data = dt.fread('titanic.csv')
data.replace('?', None)
import datatable as dt
from sklearn.datasets import load_wine
data = dt.Frame(load_wine(as_frame=True).frame)
data.mean()
import datatable as dt
from seaborn import load_dataset
data = dt.Frame(load_dataset('penguins'))
data.replace('NA', None)
data[:, dt.median(dt.f.body_mass_g), dt.by([dt.f.species, dt.f.sex])]
import datatable as dt
data = list(dt.iread('datatableton_sample.zip'))
orders_jan = data[1]
orders_feb = data[0]
orders_mar = data[2]
orders_all = dt.rbind(orders_jan, orders_feb, orders_mar)
returns = data[3]
orders_all.key = 'Order ID'
sales = returns[:, :, dt.join(orders_all)]
import datatable as dt
data = dt.fread('datatableton_sample.csv')
data['previous_timestamp'] = dt.shift(dt.f.timestamp, n=1)
import datatable as dt
from datatable.models import Ftrl
data = dt.fread('kdd_ctr.csv', fill=True)[1:,:]
target = data['click']
del data['click']
model_ftrl = Ftrl()
model_ftrl.fit(data, target)
import datatable as dt
from datatable.models import Ftrl
train = dt.fread('tradeshift-text-classification/train.csv.gz')
test = dt.fread('tradeshift-text-classification/test.csv.gz')
train_labels = dt.fread('tradeshift-text-classification/trainLabels.csv.gz')
test_ids = test['id']
del train['id']
del test['id']
submission = dt.Frame()
for target in train_labels.names[1:]:
print(f'Model for target {target}')
model_ftrl = Ftrl(nepochs=5, nbins=10**8, lambda1=0.1)
model_ftrl.fit(train, train_labels[target])
preds_test = model_ftrl.predict(test)
submission_target = dt.Frame(id_label=test_ids[:, dt.as_type(dt.f.id, str) + f'_{target}'], pred=preds_test['True'])
submission.rbind(submission_target)
submission.to_csv('submission.csv')
DatatableTon is open-source and freely available on GitHub . Special thanks to Parul Pandey & Shrinidhi Narasimhan for collaborating
This article was first published on Medium .