{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "# Convert your Dataset", "id": "33af4caa8c4c272c" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## The \"Long Format\"\n", "The basic format to convert any dataset to our representation is the long format.\n", "The long format is simply a tuple:\n", "\n", "```(time_series_id, channel_id, timestamp, value, static_var_1, static_var_2, ...)```.\n", "\n", "If your dataset contains rows that are in this format, you are almost good to go. Else, there will be a little bit of preprocessing to do." ], "id": "6d5d8a4732a08e9f" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Case 1. (easy) Your dataset is already in the long format\n", "\n", "Let's assume for now your dataset is already in this form. Here is a minimal working example.\n" ], "id": "c09e80de3d48230e" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.027232Z", "start_time": "2025-05-06T16:59:01.025194Z" } }, "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np" ], "id": "931e7eb7d971dc65", "outputs": [], "execution_count": 28 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.055326Z", "start_time": "2025-05-06T16:59:01.043635Z" } }, "cell_type": "code", "source": [ "df = pd.DataFrame(\n", " {\n", " \"time_series_id\": np.random.choice([\"A\", \"B\", \"C\"], size=100),\n", " \"channel_id\": np.random.choice([\"X\", \"Y\", \"Z\"], size=100),\n", " \"timestamp\": pd.date_range(\"2023-01-01\", periods=100, freq=\"H\"),\n", " \"value\": np.random.randn(100),\n", " }\n", ")\n", "df[\"labels\"] = df[\"time_series_id\"].map(\n", " {\"A\": 0, \"B\": 1, \"C\": 1}\n", ") # let's say we have labels\n", "df.head()" ], "id": "61f5e747cbb65bba", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/kj/v66zvn217x31k6lx63lt02q40000gn/T/ipykernel_11325/3078918095.py:5: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.\n", " \"timestamp\": pd.date_range(\"2023-01-01\", periods=100, freq=\"H\"),\n" ] }, { "data": { "text/plain": [ " time_series_id channel_id timestamp value labels\n", "0 B Y 2023-01-01 00:00:00 0.105162 1\n", "1 B Z 2023-01-01 01:00:00 -0.573337 1\n", "2 B X 2023-01-01 02:00:00 -1.973967 1\n", "3 C Y 2023-01-01 03:00:00 0.656065 1\n", "4 A Y 2023-01-01 04:00:00 -0.500246 0" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
time_series_idchannel_idtimestampvaluelabels
0BY2023-01-01 00:00:000.1051621
1BZ2023-01-01 01:00:00-0.5733371
2BX2023-01-01 02:00:00-1.9739671
3CY2023-01-01 03:00:000.6560651
4AY2023-01-01 04:00:00-0.5002460
\n", "
" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 29 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.099697Z", "start_time": "2025-05-06T16:59:01.095092Z" } }, "cell_type": "code", "source": [ "# Let's save this dataframe to a CSV file\n", "df.to_csv(\"your_original_dataset.csv\", index=False)" ], "id": "5dae6a52bf7ec964", "outputs": [], "execution_count": 30 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.146158Z", "start_time": "2025-05-06T16:59:01.143581Z" } }, "cell_type": "code", "source": [ "# the csv file can be converted to our format using our interface\n", "\n", "from pyrregular.io_utils import read_csv\n", "from pyrregular.reader_interface import ReaderInterface\n", "from pyrregular.accessor import IrregularAccessor\n", "\n", "\n", "class YourDataset(ReaderInterface):\n", " @staticmethod\n", " def read_original_version(verbose=False):\n", " return read_csv(\n", " filenames=\"your_original_dataset.csv\",\n", " ts_id=\"time_series_id\",\n", " time_id=\"timestamp\",\n", " signal_id=\"channel_id\",\n", " value_id=\"value\",\n", " dims={\n", " \"ts_id\": [\n", " \"labels\"\n", " ], # static variable that depends on the time series id\n", " \"signal_id\": [],\n", " \"time_id\": [],\n", " },\n", " time_index_as_datetime=False,\n", " verbose=verbose,\n", " )" ], "id": "abc3d6419c6cf571", "outputs": [], "execution_count": 31 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.245803Z", "start_time": "2025-05-06T16:59:01.216520Z" } }, "cell_type": "code", "source": [ "da = YourDataset.read_original_version(True)\n", "da" ], "id": "579526f379348b93", "outputs": [ { "data": { "text/plain": [ "Getting dataset metadata: 0it [00:00, ?it/s]" ], "application/vnd.jupyter.widget-view+json": { "version_major": 2, "version_minor": 0, "model_id": "180e01eb4d944c2f88f8c6ef27389b6b" } }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Reading dataset: 0%| | 0/100 [00:00 Size: 3kB\n", "\n", "Coordinates:\n", " * time_id (time_id) \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray (ts_id: 3, signal_id: 3, time_id: 100)> Size: 3kB\n",
       "<COO: shape=(3, 3, 100), dtype=float64, nnz=100, fill_value=nan>\n",
       "Coordinates:\n",
       "  * time_id    (time_id) <U19 8kB '2023-01-01 00:00:00' ... '2023-01-05 03:00...\n",
       "    labels     (ts_id) int64 24B 0 1 1\n",
       "  * ts_id      (ts_id) <U1 12B 'A' 'B' 'C'\n",
       "  * signal_id  (signal_id) <U1 12B 'X' 'Y' 'Z'
" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 32 }, { "metadata": {}, "cell_type": "markdown", "source": "If you don't know if a variable is static, or to which dimension it depends from, you can check it.", "id": "db51c5a385d8d3ff" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.319815Z", "start_time": "2025-05-06T16:59:01.313080Z" } }, "cell_type": "code", "source": [ "from pyrregular.data_utils import infer_static_columns\n", "\n", "infer_static_columns(df, \"time_series_id\")" ], "id": "82e5019a93603feb", "outputs": [ { "data": { "text/plain": [ "['labels']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 33 }, { "metadata": {}, "cell_type": "markdown", "source": "The dataset can be saved with our custom accessor", "id": "f77ce0d1a47faf90" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.365207Z", "start_time": "2025-05-06T16:59:01.355981Z" } }, "cell_type": "code", "source": "da.irr.to_hdf5(\"your_dataset.h5\")", "id": "3393ee56bff7e08d", "outputs": [], "execution_count": 34 }, { "metadata": {}, "cell_type": "markdown", "source": "And then loaded directly with xarray", "id": "9647b03a64745745" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.376158Z", "start_time": "2025-05-06T16:59:01.374106Z" } }, "cell_type": "code", "source": "import xarray as xr", "id": "b3cddf92348de55f", "outputs": [], "execution_count": 35 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.421359Z", "start_time": "2025-05-06T16:59:01.392360Z" } }, "cell_type": "code", "source": [ "da2 = xr.load_dataset(\"your_dataset.h5\", engine=\"pyrregular\")\n", "da2" ], "id": "9590c1b8eaca2b0b", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/francesco/github/irregular_ts/irregular_ts/accessor.py:9: AccessorRegistrationWarning: registration of accessor under name 'irr' for type is overriding a preexisting attribute with the same name.\n", " @xr.register_dataarray_accessor(\"irr\")\n" ] }, { "data": { "text/plain": [ " Size: 11kB\n", "Dimensions: (ts_id: 3, signal_id: 3, time_id: 100)\n", "Coordinates:\n", " labels (ts_id) int32 12B 0 1 1\n", " * signal_id (signal_id) " ], "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 11kB\n",
       "Dimensions:    (ts_id: 3, signal_id: 3, time_id: 100)\n",
       "Coordinates:\n",
       "    labels     (ts_id) int32 12B 0 1 1\n",
       "  * signal_id  (signal_id) <U1 12B 'X' 'Y' 'Z'\n",
       "  * time_id    (time_id) <U19 8kB '2023-01-01 00:00:00' ... '2023-01-05 03:00...\n",
       "  * ts_id      (ts_id) <U1 12B 'A' 'B' 'C'\n",
       "Data variables:\n",
       "    data       (ts_id, signal_id, time_id) float64 3kB <COO: nnz=100, fill_value=nan>
" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 36 }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Case 2. Your dataset is not in the long format\n", "Let's say you have a 3d numpy array, containing the time series, and a numpy array containing only the labels." ], "id": "ce74a62270e98721" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.479570Z", "start_time": "2025-05-06T16:59:01.475252Z" } }, "cell_type": "code", "source": [ "import numpy as np\n", "\n", "shape = (10, 2, 100) # 10 time series, 2 channels, 100 timestamps\n", "data = np.full(shape, np.nan)\n", "mask = np.random.rand(*shape) < 0.35\n", "data[mask] = np.random.randn(mask.sum())\n", "labels = np.random.randint(0, 2, shape[0])\n", "\n", "np.save(\"your_more_complex_dataset.npy\", data)\n", "np.save(\"your_more_complex_dataset_labels.npy\", labels)\n", "\n", "data.shape, labels.shape" ], "id": "18c4936d864c29d4", "outputs": [ { "data": { "text/plain": [ "((10, 2, 100), (10,))" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 37 }, { "metadata": {}, "cell_type": "markdown", "source": "You need only a function that takes the data and the labels, and returns a dataframe in the long format, yielding it row by row.", "id": "cdf6a98deab08259" }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.537723Z", "start_time": "2025-05-06T16:59:01.534894Z" } }, "cell_type": "code", "source": [ "def read_your_dataset(filenames):\n", " data = np.load(filenames[\"data\"])\n", " labels = np.load(filenames[\"labels\"])\n", " ts_ids, signal_ids, timestamps = np.indices(shape)\n", " ts_ids, signal_ids, timestamps = ts_ids.ravel(), signal_ids.ravel(), timestamps.ravel()\n", "\n", " for ts_id, signal_id, timestamp in zip(ts_ids, signal_ids, timestamps):\n", " value = data[ts_id, signal_id, timestamp]\n", " if np.isnan(value):\n", " continue\n", " label = labels[ts_id]\n", " yield dict(\n", " time_series_id=ts_id,\n", " channel_id=signal_id,\n", " timestamp=timestamp,\n", " value=value,\n", " labels=label,\n", " )" ], "id": "daa382c511dbec", "outputs": [], "execution_count": 38 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.558076Z", "start_time": "2025-05-06T16:59:01.554128Z" } }, "cell_type": "code", "source": [ "from pyrregular.io_utils import read_csv\n", "from pyrregular.reader_interface import ReaderInterface\n", "from pyrregular.accessor import IrregularAccessor\n", "\n", "class YourDataset(ReaderInterface):\n", " @staticmethod\n", " def read_original_version(verbose=False):\n", " return read_csv(\n", " filenames={\n", " \"data\": \"your_more_complex_dataset.npy\",\n", " \"labels\": \"your_more_complex_dataset_labels.npy\",\n", " },\n", " ts_id=\"time_series_id\",\n", " time_id=\"timestamp\",\n", " signal_id=\"channel_id\",\n", " value_id=\"value\",\n", " dims={\n", " \"ts_id\": [\n", " \"labels\"\n", " ], # static variable that depends on the time series id\n", " \"signal_id\": [],\n", " \"time_id\": [],\n", " },\n", " reader_fun=read_your_dataset,\n", " time_index_as_datetime=False,\n", " verbose=verbose,\n", " attrs={\n", " \"authors\": \"Bond, James Bond\", # you can add any attribute you want\n", " }\n", " )" ], "id": "5567545c76edb180", "outputs": [], "execution_count": 39 }, { "metadata": { "ExecuteTime": { "end_time": "2025-05-06T16:59:01.650998Z", "start_time": "2025-05-06T16:59:01.617425Z" } }, "cell_type": "code", "source": [ "da = YourDataset.read_original_version(True)\n", "da" ], "id": "fb3a5db18fa0b8c4", "outputs": [ { "data": { "text/plain": [ "Getting dataset metadata: 0it [00:00, ?it/s]" ], "application/vnd.jupyter.widget-view+json": { "version_major": 2, "version_minor": 0, "model_id": "ed2b57b62e054f628a4c34113b4ba6ae" } }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Reading dataset: 0%| | 0/720 [00:00 Size: 23kB\n", "\n", "Coordinates:\n", " * time_id (time_id) int64 800B 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99\n", " labels (ts_id) int64 80B 0 0 0 1 1 1 0 1 1 0\n", " * ts_id (ts_id) \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray (ts_id: 10, signal_id: 2, time_id: 100)> Size: 23kB\n",
       "<COO: shape=(10, 2, 100), dtype=float64, nnz=720, fill_value=nan>\n",
       "Coordinates:\n",
       "  * time_id    (time_id) int64 800B 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99\n",
       "    labels     (ts_id) int64 80B 0 0 0 1 1 1 0 1 1 0\n",
       "  * ts_id      (ts_id) <U21 840B '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'\n",
       "  * signal_id  (signal_id) <U21 168B '0' '1'\n",
       "Attributes:\n",
       "    authors:  Bond, James Bond
" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 40 } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }