Tabular Data
xskillscore
can be used on tabular data such as that stored in a pandas.DataFrame
.
It can be used most effectively when evaluating predictions over different fields.
[1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
import xskillscore as xs
np.random.seed(seed=42)
California house prices dataset
A small example is to take a dataset and evaluate the model according to a field (column).
Load the California house prices dataset:
[2]:
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df["AveRooms"] = df["AveRooms"].round()
df = df.rename(columns={"MedHouseVal": "y"})
df
[2]:
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | y | |
---|---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 7.0 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
1 | 8.3014 | 21.0 | 6.0 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
2 | 7.2574 | 52.0 | 8.0 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
3 | 5.6431 | 52.0 | 6.0 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
4 | 3.8462 | 52.0 | 6.0 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | 1.5603 | 25.0 | 5.0 | 1.133333 | 845.0 | 2.560606 | 39.48 | -121.09 | 0.781 |
20636 | 2.5568 | 18.0 | 6.0 | 1.315789 | 356.0 | 3.122807 | 39.49 | -121.21 | 0.771 |
20637 | 1.7000 | 17.0 | 5.0 | 1.120092 | 1007.0 | 2.325635 | 39.43 | -121.22 | 0.923 |
20638 | 1.8672 | 18.0 | 5.0 | 1.171920 | 741.0 | 2.123209 | 39.43 | -121.32 | 0.847 |
20639 | 2.3886 | 16.0 | 5.0 | 1.162264 | 1387.0 | 2.616981 | 39.37 | -121.24 | 0.894 |
20640 rows × 9 columns
Create a dummy prediction column by adding noise to y
:
[3]:
noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
Evaluate the model over the field AveRooms
using pandas.groupby.apply
with mean_squared_error
from scikit-learn
:
[4]:
df.groupby("AveRooms").apply(lambda x: mean_squared_error(x["y"], x["yhat"])).head()
[4]:
AveRooms
1.0 1.789466
2.0 1.827004
3.0 1.492455
4.0 1.352848
5.0 1.384756
dtype: float64
You could also do the following using xskillscore
.
First, structure the pandas.DataFrame
to keep the core fields when converting to an xarray
object:
[5]:
min_df = df.reset_index().set_index(["index", "AveRooms"])[["y", "yhat"]]
min_df
[5]:
y | yhat | ||
---|---|---|---|
index | AveRooms | ||
0 | 7.0 | 4.526 | 3.390337 |
1 | 6.0 | 3.585 | 6.816622 |
2 | 8.0 | 3.521 | 5.154701 |
3 | 6.0 | 3.413 | 4.086443 |
4 | 6.0 | 3.422 | 1.067792 |
... | ... | ... | ... |
20635 | 5.0 | 0.781 | 0.611083 |
20636 | 6.0 | 0.771 | 1.497737 |
20637 | 5.0 | 0.923 | 0.648200 |
20638 | 5.0 | 0.847 | 1.470100 |
20639 | 5.0 | 0.894 | 0.166662 |
20640 rows × 2 columns
Convert it to an xarray.Dataset
using pandas.DataFrame.to_xarray
. Note: This will create an array of index
by AveRooms
and pad the values that do not exist with nan
.
[6]:
ds = min_df.to_xarray()
ds
[6]:
<xarray.Dataset> Dimensions: (index: 20640, AveRooms: 47) Coordinates: * index (index) int64 0 1 2 3 4 5 ... 20634 20635 20636 20637 20638 20639 * AveRooms (AveRooms) float64 1.0 2.0 3.0 4.0 5.0 ... 60.0 62.0 133.0 142.0 Data variables: y (index, AveRooms) float64 nan nan nan nan nan ... nan nan nan nan yhat (index, AveRooms) float64 nan nan nan nan nan ... nan nan nan nan
- index: 20640
- AveRooms: 47
- index(index)int640 1 2 3 ... 20636 20637 20638 20639
array([ 0, 1, 2, ..., 20637, 20638, 20639])
- AveRooms(AveRooms)float641.0 2.0 3.0 ... 62.0 133.0 142.0
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 34., 35., 36., 37., 39., 40., 41., 48., 51., 53., 56., 60., 62., 133., 142.])
- y(index, AveRooms)float64nan nan nan nan ... nan nan nan nan
array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]])
- yhat(index, AveRooms)float64nan nan nan nan ... nan nan nan nan
array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]])
You call now apply any metric from xskillscore
using the accessor method. The input for the dim
argument is index
as we want to reduce this dimension and apply the metric over AveRooms
. In addition, there are nan
’s in the xarray.Dataset
so you should use skipna=True
:
[7]:
out = ds.xs.mse("y", "yhat", dim="index", skipna=True)
out
[7]:
<xarray.DataArray (AveRooms: 47)> array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00, 1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00, 5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00, 5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00, 8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01, 2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01, 4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00, 1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01, 4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01, 5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01, 1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01, 1.04411235e+00, 1.23495233e+00, 2.64087781e-01]) Coordinates: * AveRooms (AveRooms) float64 1.0 2.0 3.0 4.0 5.0 ... 60.0 62.0 133.0 142.0
- AveRooms: 47
- 1.789 1.827 1.492 1.353 1.385 1.79 ... 4.194 0.1741 1.044 1.235 0.2641
array([1.78946554e+00, 1.82700352e+00, 1.49245536e+00, 1.35284843e+00, 1.38475581e+00, 1.78975720e+00, 3.26525395e+00, 4.80747797e+00, 5.16524336e+00, 3.13496890e+00, 1.23401728e+00, 1.12562885e+00, 5.54888374e-01, 2.61824323e+00, 8.84529997e-01, 1.17865387e+00, 8.96786588e-01, 6.93484341e-01, 8.44837355e-01, 9.50615751e-01, 2.55912220e+00, 4.16548298e-01, 3.07284580e-01, 8.31537279e-01, 4.06466713e+00, 8.79983025e-01, 1.09491040e-02, 1.12379707e+00, 1.50188148e+00, 1.56069394e+00, 2.73330025e-02, 2.68438951e-01, 4.63967683e-01, 1.47081770e+00, 3.28568563e+00, 4.86835859e-01, 5.48064237e-04, 1.40563208e+00, 9.04093610e-01, 3.26459003e-01, 1.48460982e-01, 3.39427104e+00, 4.19379397e+00, 1.74130396e-01, 1.04411235e+00, 1.23495233e+00, 2.64087781e-01])
- AveRooms(AveRooms)float641.0 2.0 3.0 ... 62.0 133.0 142.0
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 34., 35., 36., 37., 39., 40., 41., 48., 51., 53., 56., 60., 62., 133., 142.])
It makes sense to return the data in tabular form hence you can call xarray.DataArray.to_series
to convert it to a pandas.Series
:
[8]:
out.to_series().head()
[8]:
AveRooms
1.0 1.789466
2.0 1.827004
3.0 1.492455
4.0 1.352848
5.0 1.384756
dtype: float64
Evaluating predictions over many columns
xskillscore
is built upon xarray.apply_ufunc
which offers speed-up by vectorizing operations. As a result xskillscore
can be faster than pandas.groupby.apply
. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.
For this exercise we will create fake data for which the predictions have to be evaluated over three fields:
[9]:
stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"y": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
noise = np.random.uniform(-1, 1, size=len(df["y"]))
df["yhat"] = (df["y"] + (df["y"] * noise)).clip(lower=df["y"].min())
df
[9]:
DATE | STORE | SKU | y | yhat | |
---|---|---|---|---|---|
0 | 2020-01-01 | 0 | 0 | 6 | 3.874272 |
1 | 2020-01-01 | 0 | 1 | 9 | 13.551266 |
2 | 2020-01-01 | 0 | 2 | 8 | 3.979884 |
3 | 2020-01-01 | 0 | 3 | 3 | 3.222543 |
4 | 2020-01-01 | 0 | 4 | 6 | 1.647346 |
... | ... | ... | ... | ... | ... |
99995 | 2020-01-10 | 99 | 95 | 1 | 1.000000 |
99996 | 2020-01-10 | 99 | 96 | 4 | 2.770135 |
99997 | 2020-01-10 | 99 | 97 | 7 | 5.820397 |
99998 | 2020-01-10 | 99 | 98 | 2 | 1.000000 |
99999 | 2020-01-10 | 99 | 99 | 2 | 1.000000 |
100000 rows × 5 columns
Time the pandas.groupby.apply
method:
[10]:
%%time
df.groupby(["STORE", "SKU"]).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
CPU times: user 3.22 s, sys: 0 ns, total: 3.22 s
Wall time: 3.22 s
[10]:
STORE SKU
0 0 10.968313
1 5.465377
2 2.546790
3 4.274809
4 8.443736
...
99 95 6.832711
96 4.262613
97 11.533266
98 14.450065
99 2.820765
Length: 10000, dtype: float64
Time it using xskillscore
:
[11]:
%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse(
"y", "yhat", dim="DATE"
).to_series()
CPU times: user 20.2 ms, sys: 0 ns, total: 20.2 ms
Wall time: 19.6 ms
[11]:
STORE SKU
0 0 10.968313
1 5.465377
2 2.546790
3 4.274809
4 8.443736
...
99 95 6.832711
96 4.262613
97 11.533266
98 14.450065
99 2.820765
Length: 10000, dtype: float64
See xskillscore-tutorial for further reading.
[ ]: