{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "# Convert your Dataset", "id": "33af4caa8c4c272c" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## The \"Long Format\"\n", "The basic format to convert any dataset to our representation is the long format.\n", "The long format is simply a tuple:\n", "\n", "```(time_series_id, channel_id, timestamp, value, static_var_1, static_var_2, ...)```.\n", "\n", "If your dataset contains rows that are in this format, you are almost good to go. Else, there will be a little bit of preprocessing to do." ], "id": "6d5d8a4732a08e9f" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Case 1. (easy) Your dataset is already in the long format\n", "\n", "Let's assume for now your dataset is already in this form. Here is a minimal working example.\n" ], "id": "c09e80de3d48230e" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.027232Z", "start_time": "2025-05-06T16:59:01.025194Z" } }, "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np" ], "id": "931e7eb7d971dc65", "outputs": [], "execution_count": 28 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.055326Z", "start_time": "2025-05-06T16:59:01.043635Z" } }, "cell_type": "code", "source": [ "df = pd.DataFrame(\n", " {\n", " \"time_series_id\": np.random.choice([\"A\", \"B\", \"C\"], size=100),\n", " \"channel_id\": np.random.choice([\"X\", \"Y\", \"Z\"], size=100),\n", " \"timestamp\": pd.date_range(\"2023-01-01\", periods=100, freq=\"H\"),\n", " \"value\": np.random.randn(100),\n", " }\n", ")\n", "df[\"labels\"] = df[\"time_series_id\"].map(\n", " {\"A\": 0, \"B\": 1, \"C\": 1}\n", ") # let's say we have labels\n", "df.head()" ], "id": "61f5e747cbb65bba", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/kj/v66zvn217x31k6lx63lt02q40000gn/T/ipykernel_11325/3078918095.py:5: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.\n", " \"timestamp\": pd.date_range(\"2023-01-01\", periods=100, freq=\"H\"),\n" ] }, { "data": { "text/plain": [ " time_series_id channel_id timestamp value labels\n", "0 B Y 2023-01-01 00:00:00 0.105162 1\n", "1 B Z 2023-01-01 01:00:00 -0.573337 1\n", "2 B X 2023-01-01 02:00:00 -1.973967 1\n", "3 C Y 2023-01-01 03:00:00 0.656065 1\n", "4 A Y 2023-01-01 04:00:00 -0.500246 0" ], "text/html": [ "
\n", " | time_series_id | \n", "channel_id | \n", "timestamp | \n", "value | \n", "labels | \n", "
---|---|---|---|---|---|
0 | \n", "B | \n", "Y | \n", "2023-01-01 00:00:00 | \n", "0.105162 | \n", "1 | \n", "
1 | \n", "B | \n", "Z | \n", "2023-01-01 01:00:00 | \n", "-0.573337 | \n", "1 | \n", "
2 | \n", "B | \n", "X | \n", "2023-01-01 02:00:00 | \n", "-1.973967 | \n", "1 | \n", "
3 | \n", "C | \n", "Y | \n", "2023-01-01 03:00:00 | \n", "0.656065 | \n", "1 | \n", "
4 | \n", "A | \n", "Y | \n", "2023-01-01 04:00:00 | \n", "-0.500246 | \n", "0 | \n", "
<xarray.DataArray (ts_id: 3, signal_id: 3, time_id: 100)> Size: 3kB\n", "<COO: shape=(3, 3, 100), dtype=float64, nnz=100, fill_value=nan>\n", "Coordinates:\n", " * time_id (time_id) <U19 8kB '2023-01-01 00:00:00' ... '2023-01-05 03:00...\n", " labels (ts_id) int64 24B 0 1 1\n", " * ts_id (ts_id) <U1 12B 'A' 'B' 'C'\n", " * signal_id (signal_id) <U1 12B 'X' 'Y' 'Z'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 32 }, { "metadata": {}, "cell_type": "markdown", "source": "If you don't know if a variable is static, or to which dimension it depends from, you can check it.", "id": "db51c5a385d8d3ff" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.319815Z", "start_time": "2025-05-06T16:59:01.313080Z" } }, "cell_type": "code", "source": [ "from pyrregular.data_utils import infer_static_columns\n", "\n", "infer_static_columns(df, \"time_series_id\")" ], "id": "82e5019a93603feb", "outputs": [ { "data": { "text/plain": [ "['labels']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 33 }, { "metadata": {}, "cell_type": "markdown", "source": "The dataset can be saved with our custom accessor", "id": "f77ce0d1a47faf90" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.365207Z", "start_time": "2025-05-06T16:59:01.355981Z" } }, "cell_type": "code", "source": "da.irr.to_hdf5(\"your_dataset.h5\")", "id": "3393ee56bff7e08d", "outputs": [], "execution_count": 34 }, { "metadata": {}, "cell_type": "markdown", "source": "And then loaded directly with xarray", "id": "9647b03a64745745" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.376158Z", "start_time": "2025-05-06T16:59:01.374106Z" } }, "cell_type": "code", "source": "import xarray as xr", "id": "b3cddf92348de55f", "outputs": [], "execution_count": 35 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.421359Z", "start_time": "2025-05-06T16:59:01.392360Z" } }, "cell_type": "code", "source": [ "da2 = xr.load_dataset(\"your_dataset.h5\", engine=\"pyrregular\")\n", "da2" ], "id": "9590c1b8eaca2b0b", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/francesco/github/irregular_ts/irregular_ts/accessor.py:9: AccessorRegistrationWarning: registration of accessor
<xarray.Dataset> Size: 11kB\n", "Dimensions: (ts_id: 3, signal_id: 3, time_id: 100)\n", "Coordinates:\n", " labels (ts_id) int32 12B 0 1 1\n", " * signal_id (signal_id) <U1 12B 'X' 'Y' 'Z'\n", " * time_id (time_id) <U19 8kB '2023-01-01 00:00:00' ... '2023-01-05 03:00...\n", " * ts_id (ts_id) <U1 12B 'A' 'B' 'C'\n", "Data variables:\n", " data (ts_id, signal_id, time_id) float64 3kB <COO: nnz=100, fill_value=nan>
<xarray.DataArray (ts_id: 10, signal_id: 2, time_id: 100)> Size: 23kB\n", "<COO: shape=(10, 2, 100), dtype=float64, nnz=720, fill_value=nan>\n", "Coordinates:\n", " * time_id (time_id) int64 800B 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99\n", " labels (ts_id) int64 80B 0 0 0 1 1 1 0 1 1 0\n", " * ts_id (ts_id) <U21 840B '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'\n", " * signal_id (signal_id) <U21 168B '0' '1'\n", "Attributes:\n", " authors: Bond, James Bond" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 40 } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }