QuantStack
xtensor

Data structures for data science

QuantStack
Avatar picture

Johan Mabille

  • Scientific software developer, Quant developer, formerly quant developer at HSBC
  • Joined QuantStack in 2016
  • Co-author of xtensor, xsimd, xeus

The three main languages of data science

Julia Python R

all have concepts of

  • N-D arrays
  • Data frames

but with slightly different semantics and incompatible in-memory representation

2015 Berkeley Workshop

Data Structures for Data Science

Workshop

On November 9, 2015, key developers from the R and Python ecosystems gathered in Berkeley for a workshop to explore data structures across programming languages and Packages.

Their conclusions

  • Enforce common formats and protocols to communicate and store these data structures.
  • Support for these protocols accross all three languages.
  • Converge on common native C++ implementations of these data structures with bindings for multiple scripting languages.

One Tensor to Rule Them All

Ring

What would it look like?

One Tensor to Rule Them All: The Requirements

  • Be as cool as NumPy
  • Enable use of data structures from existing tensor libraries transparently
  • Enable transparent use of large tensors backed by database calls or the file system.

The One Tensor to Rule them All cannot be just one container implementation.

Instead, it has to be a flexible expression system in which any data structure can be plugged, offering the most expressive and flexible API to the users.

xtensor

The Lazy Tensor Algebra Expression System

What is xtensor?

  • A C++ template library for multi-dimensional array manipulation

    xtensor
    • Followings the idioms of the C++ STL

      (iterator pairs, clear value semantics)

    • But also an API similar to that of numpy

What is xtensor?

  • Python bindings to enable xtensor APIs on numpy arrays.

    xtensor-python
  • Julia bindings to enable xtensor APIs on Julia arrays.

    xtensor-julia
  • A cookiecutter project for authoring of Python extensions.

    xtensor-cookiecutter
  • All open-source (BSD License).

Ever heard of numpy ?

Python 3 - numpy

C++ 14 - xtensor


                                    np.array([[3, 4], [5, 6]])

                                    arr.reshape([3, 4])
                                

                                    xt::xarray<double>({{3, 4}, {5, 6}})
                                    xt::xtensor<double, 2>({{3, 4}, {5, 6}})
                                    arr.reshape({3, 4});
                                

                                    np.linspace(1.0, 10.0, 100)
                                    np.logspace(1.0, 10.0, 100)
                                    np.arange(3, 7)
                                    np.eye(4)
                                    np.zeros([3, 4])
                                    np.ones([3, 4])
                                

                                    xt::linspace<double>(1.0, 10.0, 100)
                                    xt::logspace<double>(1.0, 10.0, 100)
                                    xt::arange(3, 7)
                                    xt::eye(4)
                                    xt::zeros<double>({3, 4})
                                    xt::ones<double>({3, 4})
                                

Python 3 - numpy

C++ 14 - xtensor


                                    a[:, np.newaxis]
                                    a[:5, 1:]
                                    a[5:1:-1, :]
                                    np.broadcast(a, [4, 5, 7])
                                    np.vectorize(f)
                                    a[a > 5]
                                    a[[0, 1], [0, 0]]
                                

                                    xt::view(a, xt::all(), xt::newaxis())
                                    xt::view(a, xt::range(_, 5), xt::range(1, _))
                                    xt::view(a, xt::range(5, 1, -1), xt::all())
                                    xt::broadcast(a, {4, 5, 7})
                                    xt::vectorize(f)
                                    xt::filter(a, a > 5)
                                    xt::index_view(a, {{0, 0}, {1, 0}})
                                

                                    np.sum(a, axis=[0, 1])
                                    np.sum(a)
                                    np.prod(a, axis=1)
                                    np.prod(a)
                                    np.mean(a, axis=1)
                                    np.mean(a)
                                

                                    xt::sum(a, {0, 1})
                                    xt::sum(a)
                                    xt::prod(a, {1})
                                    xt::prod(a)
                                    xt::mean(a, {1})
                                    xt::mean(a)
                                

Python 3 - numpy

C++ 14 - xtensor


                                    np.where(a > 5, a, b)
                                    np.where(a > 5)
                                    np.any(a)
                                    np.all(a)
                                    np.logical_and(a, b)
                                    np.logical_or(a, b)
                                

                                    xt::where(a > 5, a, b)
                                    xt::where(a > 5)
                                    xt::any(a)
                                    xt::all(a)
                                    a && b
                                    a || b
                                

                                    np.absolute(a)
                                    np.exp(a)
                                    np.sqrt(a)
                                    np.cos(a)
                                    np.cosh(a)
                                    scipy.special.erf(a)
                                    np.isnan(a)
                                

                                    xt::abs(a)
                                    xt::exp(a)
                                    xt::sqrt(a)
                                    xt::cos(a)
                                    xt::cosh(a)
                                    xt::erf(a)
                                    xt::isnan(a)
                                

Lazy evaluation

res does not hold any value, nothing is evaluated ...


                                    xarray<double> a, b, c;
                                    // ... initialization of a, b and c ... 
                                    auto res = sqrt(cos(a) + sin(b)) * c;
                                

... until assigned to a container ...


                                xarray<double> d = res;
                                // or
                                xarray<double> d = sqrt(cos(a) + sin(b)) * c;
                                

... or upon access


                                    // Assuming res is a 2D tensor
                                    double d = res(4, 2);
                                

So what is res ?

magic.gif

res is an expression

expression.svg

Expression system

Client code


                                    xarray<double> a, b, c;
                                    // ... initialization of a, b and c ... 
                                    xarray<double> res = sqrt(cos(a) + sin(b)) * c;
                                

Traditional implementation


                                    xarray<double> tmp1 = cos(a);
                                    xarray<double> tmp2 = sin(b);
                                    xarray<double> tmp3 = tmp1 + tmp2;
                                    xarray<double> tmp4 = sqrt(tmp3);
                                    xarray<double> res = tmp4 * c;
                                

xtensor generated implementation


                                    xarray<double> res(broadcast_shape(a, b, c));
                                    for(size_t i = 0; i < res.shape()[0]; ++i)
                                        res(i) = sqrt(cos(a(i)) + sin(b(i))) * c(i);
                                

Expression system

Client code


                                    xarray<double> a, b, c;
                                    // ... initialization of a, b and c ... 
                                    double res = (sqrt(cos(a) + sin(b)) * c)(4, 2);
                                

Traditional implementation


                                    xarray<double> tmp1 = cos(a);
                                    xarray<double> tmp2 = sin(b);
                                    xarray<double> tmp3 = tmp1 + tmp2;
                                    xarray<double> tmp4 = sqrt(tmp3);
                                    xarray<double> tmp5 = tmp4 * c;
                                    double d = tmp5(4, 2);
                                

xtensor generated implementation


                                    double d = sqrt(cos(a(4, 2)) + sin(b(4, 2)))
                                               * c(4, 2);
                                

Broadcasting


                                    xarray<int> a = {{1,  2,  3,  4},
                                                     {5,  6,  7,  8},
                                                     {9, 10, 11, 12}};
                                    xarray<int> b = { 1, 3, 5, 7};
                                
broadcasting4.svg

                                    xarray<int> res = a + b;
                                
broadcasting5.svg

Broadcasting


                                    xarray<double> a = {1., 2., 3., 4.};
                                
broadcasting1.svg

                                    auto res = xt::broadcast(a, {3, 4});
                                
broadcasting2.svg

                                    xarray<double> a, b, c;
                                    // ... initialization of a, b, and c ...
                                    auto res = broadcast(a + b * c, {3, 4});
                                
broadcasting3.svg

Views


                                    xarray<int> a = {{1,  2,  3,  4},
                                                     {5,  6,  7,  8},
                                                     {9, 10, 11, 12}};
                                    auto res = view(a, range(0, 3, 2),
                                                       range(0, 3, 2));
                                
view1.svg

                                    a(0, 0) = 0;
                                    assert(res(0, 0) == 0);

                                    res(1, 1) = 20;
                                    assert(a(2, 2)) == 20);
                                
view2.svg

Views


                                    xarray<int> a = {{1,  2,  3,  4},
                                                     {5,  6,  7,  8},
                                                     {9, 10, 11, 12}};
                                    auto res = view(a, 1, all());
                                
view3.svg

                                    xarray<int> a = {{1,  2,  3,  4},
                                                     {5,  6,  7,  8},
                                                     {9, 10, 11, 12}};
                                    auto res = view(a, 1, range(1, _));
                                
view4.svg

Views


                                    xarray<int> a = {{21,  2,  3,  4},
                                                     { 5,  6, 27,  8},
                                                     { 9, 10, 31, 12}};
                                    auto res = indexview(a, {{0, 0}, 
                                                             {1, 2},
                                                             {2, 2}});
                                
view5.svg

                                    xarray<int> a = {{21,  2,  3,  4},
                                                     { 5,  6, 27,  8},
                                                     { 9, 10, 31, 12}};
                                    auto res = filter(a, a > 20); 
                                
view5.svg

                                    res += 5; 
                                
view6.svg

Missing values


                                xtensor_optional<int> a =
                                    {{ 1,       2        },
                                     { 3, missing<int>() }};
                                xtensor<int> b = {{ 1, 2 },
                                                  { 3, 4 }};
                            
missing_values1.svg

                                auto res = a + b;
                                auto hv = has_value(a + b);
                            
missing_values2.svg

How do I get it?

conda.svg

With conda

                        
                        conda install xtensor -c conda-forge
                        conda install xtensor-python -c conda-forge
                        
                        
debian.svg

With Debian

                        
                        apt-get install xtensor-dev
                        apt-get install xtensor-python-dev
                        
                        
cmake.svg

From source with cmake

                        
                        cmake -DCMAKE_INSTALL_PREFIX=/path/to/prefix /source/dir
                        make install
                        
                        

How do I try it without installing anything?

http://quantstack.net/xtensor

QuantStack website

A C++ notebook

Notebook Screenshot

Authoring Python extensions (1/2)

C++: Using an algorithm from the STL on a numpy array


#include <numeric>                        // Standard library import for std::accumulate
#include "pybind11/pybind11.h"            // Pybind11 import to define Python bindings
#include "xtensor/xmath.hpp"              // xtensor import for the C++ universal functions
#define FORCE_IMPORT_ARRAY                // numpy C api loading
#include "xtensor-python/pyarray.hpp"     // Numpy bindings

double sum_of_sines(xt::pyarray<double> &m)
{
    auto sines = xt::sin(m);
    // sines does not actually hold any value, which are only computed upon access
    return std::accumulate(sines.begin(), sines.end(), 0.0);
}

PYBIND11_PLUGIN(xtensor_python_test)
{
    xt::import_numpy();
    pybind11::module m("xtensor_python_test", "Test module for xtensor python bindings");

    m.def("sum_of_sines", sum_of_sines,
        "Computes the sum of the sines of the values of the input array");

    return m.ptr();
}

Authoring Python extensions (1/2)

Python: Using an algorithm from the STL on a numpy array


import numpy as np
import xtensor_python_test as xt

a = np.arange(15).reshape(3, 5)
xt.sum_of_sines(v)

Authoring Python extensions (2/2)

C++: Create a universal function from a C++ scalar function


#include "pybind11/pybind11.h"
#define FORCE_IMPORT_ARRAY
#include "xtensor-python/pyvectorize.hpp"
#include <numeric>
#include <cmath>

namespace py = pybind11;

double scalar_func(double i, double j)
{
    return std::sin(i) - std::cos(j);
}

PYBIND11_PLUGIN(xtensor_python_test)
{
    xt::import_numpy();
    py::module m("xtensor_python_test", "Test module for xtensor python bindings");

    m.def("vectorized_func", xt::pyvectorize(scalar_func), "");

    return m.ptr();
}

Authoring Python extensions (2/2)

Python: Create a universal function from a C++ scalar function


import numpy as np
import xtensor_python_test as xt

x = np.arange(15).reshape(3, 5)
y = [1, 2, 3, 4, 5]
xt.vectorized_func(x, y)

xtensor-cookiecutter

Generate your packaged xtensor extension

  • With a few examples from the documentation
  • Unit-tests
  • HTML documentation
  • A setup.py for pushing it on pypi

What's next ?

  • Bindings with Arrow
  • Bindings with R
  • Bindings with BLAS libraries

    xtensor-blas
  • SIMD and GPU acceleration
  • World domination

Credits