Tabular Data

xskillscore can be used on tabular data such as that stored in a pandas.DataFrame.

It can be used most effectively when evaluating predictions over different fields.

[1]:

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

import xskillscore as xs

np.random.seed(seed=42)

California house prices dataset

A small example is to take a dataset and evaluate the model according to a field (column).

Load the California house prices dataset:

[2]:

housing = fetch_california_housing(as_frame=True)
df = housing.frame
df["AveRooms"] = df["AveRooms"].round()
df = df.rename(columns={"MedHouseVal": "y"})
df

[2]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	y
0	8.3252	41.0	7.0	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.0	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.0	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	6.0	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.0	1.081081	565.0	2.181467	37.85	-122.25	3.422
...	...	...	...	...	...	...	...	...	...
20635	1.5603	25.0	5.0	1.133333	845.0	2.560606	39.48	-121.09	0.781
20636	2.5568	18.0	6.0	1.315789	356.0	3.122807	39.49	-121.21	0.771
20637	1.7000	17.0	5.0	1.120092	1007.0	2.325635	39.43	-121.22	0.923
20638	1.8672	18.0	5.0	1.171920	741.0	2.123209	39.43	-121.32	0.847
20639	2.3886	16.0	5.0	1.162264	1387.0	2.616981	39.37	-121.24	0.894

20640 rows × 9 columns

Create a dummy prediction column by adding noise to y:

[3]:

noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())

Evaluate the model over the field AveRooms using pandas.groupby.apply with mean_squared_error from scikit-learn:

[4]:

df.groupby("AveRooms").apply(lambda x: mean_squared_error(x["y"], x["yhat"])).head()

[4]:

AveRooms
1.0    1.789466
2.0    1.827004
3.0    1.492455
4.0    1.352848
5.0    1.384756
dtype: float64

You could also do the following using xskillscore.

First, structure the pandas.DataFrame to keep the core fields when converting to an xarray object:

[5]:

min_df = df.reset_index().set_index(["index", "AveRooms"])[["y", "yhat"]]
min_df

[5]:

		y	yhat
index	AveRooms
0	7.0	4.526	3.390337
1	6.0	3.585	6.816622
2	8.0	3.521	5.154701
3	6.0	3.413	4.086443
4	6.0	3.422	1.067792
...	...	...	...
20635	5.0	0.781	0.611083
20636	6.0	0.771	1.497737
20637	5.0	0.923	0.648200
20638	5.0	0.847	1.470100
20639	5.0	0.894	0.166662

20640 rows × 2 columns

Convert it to an xarray.Dataset using pandas.DataFrame.to_xarray. Note: This will create an array of index by AveRooms and pad the values that do not exist with nan.

[6]:

ds = min_df.to_xarray()
ds

You call now apply any metric from xskillscore using the accessor method. The input for the dim argument is index as we want to reduce this dimension and apply the metric over AveRooms. In addition, there are nan’s in the xarray.Dataset so you should use skipna=True:

[7]:

out = ds.xs.mse("y", "yhat", dim="index", skipna=True)
out

[7]:

<xarray.DataArray (AveRooms: 47)>
array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00,
       1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00,
       5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00,
       5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00,
       8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01,
       2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01,
       4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00,
       1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01,
       4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01,
       5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01,
       1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01,
       1.04411235e+00, 1.23495233e+00, 2.64087781e-01])
Coordinates:
  * AveRooms  (AveRooms) float64 1.0 2.0 3.0 4.0 5.0 ... 60.0 62.0 133.0 142.0

It makes sense to return the data in tabular form hence you can call xarray.DataArray.to_series to convert it to a pandas.Series:

[8]:

out.to_series().head()

[8]:

AveRooms
1.0    1.789466
2.0    1.827004
3.0    1.492455
4.0    1.352848
5.0    1.384756
dtype: float64

Evaluating predictions over many columns

xskillscore is built upon xarray.apply_ufunc which offers speed-up by vectorizing operations. As a result xskillscore can be faster than pandas.groupby.apply. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.

For this exercise we will create fake data for which the predictions have to be evaluated over three fields:

[9]:

stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "y": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)

noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
df

[9]:

	DATE	STORE	SKU	y	yhat
0	2020-01-01	0	0	6	3.874272
1	2020-01-01	0	1	9	13.551266
2	2020-01-01	0	2	8	3.979884
3	2020-01-01	0	3	3	3.222543
4	2020-01-01	0	4	6	1.647346
...	...	...	...	...	...
99995	2020-01-10	99	95	1	1.000000
99996	2020-01-10	99	96	4	2.770135
99997	2020-01-10	99	97	7	5.820397
99998	2020-01-10	99	98	2	1.000000
99999	2020-01-10	99	99	2	1.000000

100000 rows × 5 columns

Time the pandas.groupby.apply method:

[10]:

%%time
df.groupby(["STORE", "SKU"]).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))

CPU times: user 3.22 s, sys: 0 ns, total: 3.22 s
Wall time: 3.22 s

[10]:

STORE  SKU
0      0      10.968313
       1       5.465377
       2       2.546790
       3       4.274809
       4       8.443736
                ...
99     95      6.832711
       96      4.262613
       97     11.533266
       98     14.450065
       99      2.820765
Length: 10000, dtype: float64

Time it using xskillscore:

[11]:

%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse(
    "y", "yhat", dim="DATE"
).to_series()

CPU times: user 20.2 ms, sys: 0 ns, total: 20.2 ms
Wall time: 19.6 ms

[11]:

STORE  SKU
0      0      10.968313
       1       5.465377
       2       2.546790
       3       4.274809
       4       8.443736
                ...
99     95      6.832711
       96      4.262613
       97     11.533266
       98     14.450065
       99      2.820765
Length: 10000, dtype: float64

See xskillscore-tutorial for further reading.

[ ]: