Sentinel Hub Statistical API

This notebook shows how to use Sentinel Hub Statistical API to obtain aggregated statistical satellite data for areas of interest. For more information about Statistical API please check the official service documentation.

Prerequisites

Imports

The tutorial requires additional packages geopandas, matplotlib, and seaborn which are not dependencies of sentinelhub-py.

[1]:

%matplotlib inline

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sentinelhub import (
    CRS,
    BBox,
    DataCollection,
    Geometry,
    SentinelHubStatistical,
    SentinelHubStatisticalDownloadClient,
    SHConfig,
    parse_time,
)

Credentials

Process API requires Sentinel Hub account. Please check configuration instructions about how to set up your Sentinel Hub credentials.

[2]:

config = SHConfig()

if not config.sh_client_id or not config.sh_client_secret:
    print("Warning! To use Statistical API, please provide the credentials (OAuth client ID and client secret).")

Helper function

A helper function that will be used in this tutorial.

[3]:

def stats_to_df(stats_data):
    """Transform Statistical API response into a pandas.DataFrame"""
    df_data = []

    for single_data in stats_data["data"]:
        df_entry = {}
        is_valid_entry = True

        df_entry["interval_from"] = parse_time(single_data["interval"]["from"]).date()
        df_entry["interval_to"] = parse_time(single_data["interval"]["to"]).date()

        for output_name, output_data in single_data["outputs"].items():
            for band_name, band_values in output_data["bands"].items():
                band_stats = band_values["stats"]
                if band_stats["sampleCount"] == band_stats["noDataCount"]:
                    is_valid_entry = False
                    break

                for stat_name, value in band_stats.items():
                    col_name = f"{output_name}_{band_name}_{stat_name}"
                    if stat_name == "percentiles":
                        for perc, perc_val in value.items():
                            perc_col_name = f"{col_name}_{perc}"
                            df_entry[perc_col_name] = perc_val
                    else:
                        df_entry[col_name] = value

        if is_valid_entry:
            df_data.append(df_entry)

    return pd.DataFrame(df_data)

Make a Statistical API request

In the Process API tutorial, we have seen how to obtain satellite imagery. Statistical API can be used in a very similar way. The main difference is that the results of Statistical API are aggregated statistical values of satellite data instead of entire images. In many use cases, such values are all that we need. By using Statistical API we can avoid downloading and processing large amounts of satellite data.

Let’s take the same bounding box of Betsiboka Estuary, used in Process API tutorial, and make a Statistical API request.

[4]:

betsiboka_bbox = BBox((46.16, -16.15, 46.51, -15.58), CRS.WGS84)

rgb_evalscript = """
//VERSION=3

function setup() {
  return {
    input: [
      {
        bands: [
          "B02",
          "B03",
          "B04",
          "dataMask"
        ]
      }
    ],
    output: [
      {
        id: "rgb",
        bands: ["R", "G", "B"]
      },
      {
        id: "dataMask",
        bands: 1
      }
    ]
  }
}

function evaluatePixel(samples) {
    return {
      rgb: [samples.B04, samples.B03, samples.B02],
      dataMask: [samples.dataMask]
    };
}
"""

rgb_request = SentinelHubStatistical(
    aggregation=SentinelHubStatistical.aggregation(
        evalscript=rgb_evalscript,
        time_interval=("2020-06-07", "2020-06-13"),
        aggregation_interval="P1D",
        size=(631, 1047),
    ),
    input_data=[SentinelHubStatistical.input_data(DataCollection.SENTINEL2_L1C, maxcc=0.8)],
    bbox=betsiboka_bbox,
    config=config,
)

The following will send the request to Sentinel Hub service and obtain results.

[5]:

%%time

rgb_stats = rgb_request.get_data()[0]

rgb_stats

CPU times: user 55.3 ms, sys: 10.3 ms, total: 65.6 ms
Wall time: 20.3 s

[5]:

{'data': [{'interval': {'from': '2020-06-07T00:00:00Z',
    'to': '2020-06-08T00:00:00Z'},
   'outputs': {'rgb': {'bands': {'R': {'stats': {'min': 0.00419999985024333,
        'max': 0.7027999758720398,
        'mean': 0.1124257412122722,
        'stDev': 0.04222600867954154,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'B': {'stats': {'min': 0.054999999701976776,
        'max': 0.5311999917030334,
        'mean': 0.10198085449414412,
        'stDev': 0.01494034416514153,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'G': {'stats': {'min': 0.030400000512599945,
        'max': 0.5981000065803528,
        'mean': 0.09885213961499702,
        'stDev': 0.019804562978698967,
        'sampleCount': 660657,
        'noDataCount': 0}}}}}},
  {'interval': {'from': '2020-06-12T00:00:00Z', 'to': '2020-06-13T00:00:00Z'},
   'outputs': {'rgb': {'bands': {'R': {'stats': {'min': 0.004600000102072954,
        'max': 0.7160000205039978,
        'mean': 0.11546704480268109,
        'stDev': 0.06332157724593372,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'B': {'stats': {'min': 0.05920000001788139,
        'max': 0.5658000111579895,
        'mean': 0.11913311262009575,
        'stDev': 0.04636540384197817,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'G': {'stats': {'min': 0.03779999911785126,
        'max': 0.6126000285148621,
        'mean': 0.11290437308774512,
        'stDev': 0.048338260231641964,
        'sampleCount': 660657,
        'noDataCount': 0}}}}}}],
 'status': 'OK'}

We obtained statistical data for pixels for each band and for both available acquisition dates.

Multiple requests for a collection of geometries

The real value of Statistical API shows for use cases where we have a large collection of small geometries, sparsely distributed over a large geographical area, and we would like to obtain statistical information for each of them.

In this example, we will take 4 small polygons, each marking a different land cover type, and collect NDVI values for each of them.

[6]:

polygons_gdf = gpd.read_file("data/statapi_test.geojson")

polygons_gdf

[6]:

	land_type	geometry
0	water	POLYGON ((411873.909 5126285.627, 412036.018 5...
1	forest	POLYGON ((451909.515 5087408.351, 452291.478 5...
2	agricultural	POLYGON ((553495.985 5137450.826, 554130.662 5...
3	urban	POLYGON ((548010.445 5156156.292, 549867.483 5...

For each polygon, we define a Statistical API request. Requested statistical data will be calculated on 10-meter resolution and aggregated per day with an available acquisition. We will also request histogram values.

[7]:

yearly_time_interval = "2020-01-01", "2020-12-31"

ndvi_evalscript = """
//VERSION=3

function setup() {
  return {
    input: [
      {
        bands: [
          "B04",
          "B08",
          "dataMask"
        ]
      }
    ],
    output: [
      {
        id: "ndvi",
        bands: 1
      },
      {
        id: "dataMask",
        bands: 1
      }
    ]
  }
}

function evaluatePixel(samples) {
    return {
      ndvi: [index(samples.B08, samples.B04)],
      dataMask: [samples.dataMask]
    };
}
"""

aggregation = SentinelHubStatistical.aggregation(
    evalscript=ndvi_evalscript, time_interval=yearly_time_interval, aggregation_interval="P1D", resolution=(10, 10)
)

input_data = SentinelHubStatistical.input_data(DataCollection.SENTINEL2_L2A)

histogram_calculations = {"ndvi": {"histograms": {"default": {"nBins": 20, "lowEdge": -1.0, "highEdge": 1.0}}}}

ndvi_requests = []

for geo_shape in polygons_gdf.geometry.values:
    request = SentinelHubStatistical(
        aggregation=aggregation,
        input_data=[input_data],
        geometry=Geometry(geo_shape, crs=CRS(polygons_gdf.crs)),
        calculations=histogram_calculations,
        config=config,
    )
    ndvi_requests.append(request)

Instead of triggering download for each request separately, we can pass request objects together to a download client object, which will execute them in parallel using multiple threads. This way the download process will be faster.

[8]:

%%time

download_requests = [ndvi_request.download_list[0] for ndvi_request in ndvi_requests]

client = SentinelHubStatisticalDownloadClient(config=config)

ndvi_stats = client.download(download_requests)

len(ndvi_stats)

CPU times: user 113 ms, sys: 16.5 ms, total: 130 ms
Wall time: 16.1 s

[8]:

Let’s convert this data into a tabular form by transforming it into a pandas dataframe.

[9]:

ndvi_dfs = [stats_to_df(polygon_stats) for polygon_stats in ndvi_stats]

for df, land_type in zip(ndvi_dfs, polygons_gdf["land_type"].values):
    df["land_type"] = land_type

ndvi_df = pd.concat(ndvi_dfs)

ndvi_df

[9]:

	interval_from	interval_to	ndvi_B0_min	ndvi_B0_max	ndvi_B0_mean	ndvi_B0_stDev	ndvi_B0_sampleCount	ndvi_B0_noDataCount	land_type
0	2020-01-01	2020-01-02	-0.043547	0.078686	0.035990	0.014587	9174	4322	water
1	2020-01-04	2020-01-05	0.013191	0.115481	0.059611	0.015508	9174	4322	water
2	2020-01-06	2020-01-07	-1.000000	0.812500	-0.203617	0.222675	9174	4322	water
3	2020-01-09	2020-01-10	-0.964912	0.962963	-0.029157	0.281115	9174	4322	water
4	2020-01-11	2020-01-12	-1.000000	0.947368	-0.032308	0.239918	9174	4322	water
...	...	...	...	...	...	...	...	...	...
140	2020-12-18	2020-12-19	-0.087464	0.067416	-0.006859	0.016857	43263	11755	urban
141	2020-12-21	2020-12-22	-0.072165	0.042030	-0.011940	0.009287	43263	11755	urban
142	2020-12-23	2020-12-24	-0.068307	0.718450	0.224862	0.123213	43263	11755	urban
143	2020-12-26	2020-12-27	-0.175758	0.994366	0.334077	0.200383	43263	11755	urban
144	2020-12-28	2020-12-29	-0.047584	0.093174	0.034037	0.016839	43263	11755	urban

505 rows × 9 columns

The following plot shows time series of mean values, buffered by standard deviation values.

[10]:

fig, ax = plt.subplots(figsize=(15, 8))
ax.set_ylabel("NDVI mean")

for idx, land_type in enumerate(polygons_gdf["land_type"].values):
    series = ndvi_df[ndvi_df["land_type"] == land_type]

    series.plot(ax=ax, x="interval_from", y="ndvi_B0_mean", color=f"C{idx}", label=land_type)

    ax.fill_between(
        series.interval_from.values,
        series["ndvi_B0_mean"] - series["ndvi_B0_stDev"],
        series["ndvi_B0_mean"] + series["ndvi_B0_stDev"],
        color=f"C{idx}",
        alpha=0.3,
    );

../_images/examples_statistical_request_20_0.png

Let’s also plot histograms for a certain timestamp.

[11]:

TIMESTAMP_INDEX = 2

plot_data = []
timestamp = None
for stats in ndvi_stats:
    bins = stats["data"][TIMESTAMP_INDEX]["outputs"]["ndvi"]["bands"]["B0"]["histogram"]["bins"]
    timestamp = stats["data"][TIMESTAMP_INDEX]["interval"]["from"].split("T")[0]

    counts = [value["count"] for value in bins]
    total_counts = sum(counts)
    counts = [round(100 * count / total_counts) for count in counts]

    bin_size = bins[1]["lowEdge"] - bins[0]["lowEdge"]
    splits = [value["lowEdge"] + bin_size / 2 for value in bins]

    data = []
    for count, split in zip(counts, splits):
        data.extend([split] * count)
    plot_data.append(np.array(data))


fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.violinplot(data=plot_data, ax=ax)
ax.set(xticklabels=polygons_gdf["land_type"].values)
plt.xlabel("Land types", fontsize=15)
plt.ylabel(f"NDVI for {timestamp}", fontsize=15);

../_images/examples_statistical_request_22_0.png

Reduce service processing costs

In case of large-scale processing, it becomes important, how many processing units we spend making Statistical API requests. We can decrease this number if we write an evalscript that outputs integer values instead of floats.

In this example, we will download statistics for all Sentinel-2 L2A bands, cloud masks and probabilities, and a few remote sensing indices. Such a collection of values would typically be used as an input to a machine learning model.

The following evalscript will:

request band values in digital numbers, instead of reflectances,
use toUINT function to convert values from indices into integers that will fit into SampleType.UINT16,
mask pixels for which cloud mask CLM indicates that they contain clouds. Such pixels will not be included in statistics.

[12]:

with open("./data/statapi_evalscript.js", "r") as fp:
    features_evalscript = fp.read()

print(features_evalscript)

//VERSION=3

function setup() {
  return {
    input: [{
      bands: ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12", "CLM", "CLP", "dataMask"],
      units: "DN"
    }],
    output: [
      {
        id: "bands",
        bands: ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
        sampleType: "UINT16"
      },
      {
        id: "masks",
        bands: ["CLM"],
        sampleType: "UINT16"
      },
      {
        id: "indices",
        bands: ["NDVI", "NDVI_RE1", "NBSI", "CLP"],
        sampleType: "UINT16"
      },
      {
        id: "dataMask",
        bands: 1
      }]
  }
}

function evaluatePixel(samples) {
    // Normalised Difference Vegetation Index and variation
    let NDVI = index(samples.B08, samples.B04);
    let NDVI_RE1 = index(samples.B08, samples.B05);

    // Bare Soil Index
    let NBSI = index((samples.B11 + samples.B04), (samples.B08 + samples.B02));

    // cloud probability normalized to interval [0, 1]
    let CLP = samples.CLP / 255.0;

    // masking cloudy pixels
    let combinedMask = samples.dataMask
    if (samples.CLM > 0) {
        combinedMask = 0;
    }

    const f = 5000;
    return {
        bands: [samples.B01, samples.B02, samples.B03, samples.B04, samples.B05, samples.B06,
                samples.B07, samples.B08, samples.B8A, samples.B09, samples.B11, samples.B12],
        masks: [samples.CLM],
        indices: [toUINT(NDVI, f), toUINT(NDVI_RE1, f), toUINT(NBSI, f), toUINT(CLP, f)],
        dataMask: [combinedMask]
    };
}

function toUINT(product, constant){
  // Clamp the output to [-1, 10] and convert it to a UNIT16
  // value that can be converted back to float later.
  if (product < -1) {
    product = -1;
  } else if (product > 10) {
    product = 10;
  }
  return Math.round(product * constant) + constant;
}

Let’s create requests for all 4 polygons. Additionally, we will request statistics for 5th, 50th and 95th percentiles.

[13]:

aggregation = SentinelHubStatistical.aggregation(
    evalscript=features_evalscript, time_interval=yearly_time_interval, aggregation_interval="P1D", resolution=(10, 10)
)

calculations = {"default": {"statistics": {"default": {"percentiles": {"k": [5, 50, 95]}}}}}

features_requests = []
for geo_shape in polygons_gdf.geometry.values:
    request = SentinelHubStatistical(
        aggregation=aggregation,
        input_data=[SentinelHubStatistical.input_data(DataCollection.SENTINEL2_L2A)],
        geometry=Geometry(geo_shape, crs=CRS(polygons_gdf.crs)),
        calculations=calculations,
        config=config,
    )

    features_requests.append(request)

[14]:

%%time

download_requests = [request.download_list[0] for request in features_requests]

client = SentinelHubStatisticalDownloadClient(config=config)

features_stats = client.download(download_requests)

len(features_stats)

CPU times: user 123 ms, sys: 5.85 ms, total: 129 ms
Wall time: 27.5 s

[14]:

Let’s convert service response into a dataframe. Compared to the previous example this dataframe will have more columns, as we requested more types of features, and fewer rows, as in some cases all pixels contained clouds and were masked.

[15]:

features_dfs = [stats_to_df(polygon_stats) for polygon_stats in features_stats]

for df, land_type in zip(features_dfs, polygons_gdf["land_type"].values):
    df["land_type"] = land_type

features_df = pd.concat(features_dfs)

features_df

[15]:

	interval_from	interval_to	indices_NDVI_min	indices_NDVI_max	indices_NDVI_mean	indices_NDVI_stDev	indices_NDVI_sampleCount	indices_NDVI_noDataCount	indices_NDVI_percentiles_5.0	indices_NDVI_percentiles_95.0	...	bands_B12_min	bands_B12_max	bands_B12_mean	bands_B12_stDev	bands_B12_sampleCount	bands_B12_noDataCount	bands_B12_percentiles_5.0	bands_B12_percentiles_95.0	bands_B12_percentiles_50.0	land_type
0	2020-01-06	2020-01-07	0.0	9063.0	3981.917972	1113.380383	9174	4322	1887.0	5607.0	...	0.0	117.0	30.087799	15.364080	9174	4322	5.0	55.0	31.0	water
1	2020-01-09	2020-01-10	175.0	9815.0	4854.220115	1405.574312	9174	4322	2444.0	6914.0	...	14.0	207.0	60.070486	18.942562	9174	4322	35.0	90.0	58.0	water
2	2020-01-11	2020-01-12	0.0	9737.0	4838.451154	1199.583916	9174	4322	2766.0	6667.0	...	0.0	236.0	26.004946	22.600604	9174	4322	4.0	58.0	21.0	water
3	2020-01-14	2020-01-15	4054.0	6590.0	5011.732165	266.857469	9174	4324	4574.0	5352.0	...	62.0	1737.0	345.222062	361.152965	9174	4324	84.0	1177.0	133.0	water
4	2020-01-21	2020-01-22	3197.0	6420.0	5147.286480	335.922995	9174	4322	4514.0	5601.0	...	6.0	1806.0	323.347279	289.699958	9174	4322	46.0	946.0	244.0	water
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
76	2020-11-21	2020-11-22	3176.0	9725.0	6772.852958	1027.245684	43263	11755	5303.0	8617.0	...	76.0	4110.0	1153.639711	402.932823	43263	11755	601.0	1895.0	1104.0	urban
77	2020-11-23	2020-11-24	4569.0	9331.0	6409.779460	856.631394	43263	20478	5240.0	8011.0	...	362.0	4789.0	1558.450647	450.545776	43263	20478	928.0	2372.0	1509.0	urban
78	2020-12-01	2020-12-02	2071.0	9132.0	6269.664099	843.729608	43263	24436	5136.0	7889.0	...	0.0	4830.0	1166.617942	410.184240	43263	24436	597.0	1894.0	1123.0	urban
79	2020-12-13	2020-12-14	345.0	9960.0	6094.379493	1095.370200	43263	42317	4966.0	8163.0	...	111.0	3364.0	1268.115222	530.037701	43263	42317	447.0	2163.0	1271.0	urban
80	2020-12-26	2020-12-27	4121.0	9972.0	6670.385299	1001.918190	43263	11755	5267.0	8491.0	...	68.0	4318.0	1064.905230	409.194536	43263	11755	494.0	1779.0	1021.0	urban

253 rows × 156 columns

Next, we can rescale values back to the correct scale. The following will:

convert statistical values for bands from digital numbers to reflectances,
apply an inverse transformation of toUINT function on statistical values for indices.

[16]:

BANDS = DataCollection.SENTINEL2_L2A.bands
INDICES = ["NDVI", "NDVI_RE1", "NBSI", "CLP"]
STATISTICAL_QUANTITIES = ["mean", "min", "max", "stDev", "percentiles_5.0", "percentiles_50.0", "percentiles_95.0"]

for band in BANDS:
    for stat in STATISTICAL_QUANTITIES:
        column_name = f"bands_{band.name}_{stat}"
        column = features_df[column_name]

        column = column / 10000.0

        features_df[column_name] = column

for index in INDICES:
    for stat in STATISTICAL_QUANTITIES:
        column_name = f"indices_{index}_{stat}"
        column = features_df[column_name]

        if stat == "stDev":
            column = column / 5000.0
        else:
            column = (column - 5000.0) / 5000.0

        features_df[column_name] = column

features_df

[16]:

	interval_from	interval_to	indices_NDVI_min	indices_NDVI_max	indices_NDVI_mean	indices_NDVI_stDev	indices_NDVI_sampleCount	indices_NDVI_noDataCount	indices_NDVI_percentiles_5.0	indices_NDVI_percentiles_95.0	...	bands_B12_min	bands_B12_max	bands_B12_mean	bands_B12_stDev	bands_B12_sampleCount	bands_B12_noDataCount	bands_B12_percentiles_5.0	bands_B12_percentiles_95.0	bands_B12_percentiles_50.0	land_type
0	2020-01-06	2020-01-07	-1.0000	0.8126	-0.203616	0.222676	9174	4322	-0.6226	0.1214	...	0.0000	0.0117	0.003009	0.001536	9174	4322	0.0005	0.0055	0.0031	water
1	2020-01-09	2020-01-10	-0.9650	0.9630	-0.029156	0.281115	9174	4322	-0.5112	0.3828	...	0.0014	0.0207	0.006007	0.001894	9174	4322	0.0035	0.0090	0.0058	water
2	2020-01-11	2020-01-12	-1.0000	0.9474	-0.032310	0.239917	9174	4322	-0.4468	0.3334	...	0.0000	0.0236	0.002600	0.002260	9174	4322	0.0004	0.0058	0.0021	water
3	2020-01-14	2020-01-15	-0.1892	0.3180	0.002346	0.053371	9174	4324	-0.0852	0.0704	...	0.0062	0.1737	0.034522	0.036115	9174	4324	0.0084	0.1177	0.0133	water
4	2020-01-21	2020-01-22	-0.3606	0.2840	0.029457	0.067185	9174	4322	-0.0972	0.1202	...	0.0006	0.1806	0.032335	0.028970	9174	4322	0.0046	0.0946	0.0244	water
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
76	2020-11-21	2020-11-22	-0.3648	0.9450	0.354571	0.205449	43263	11755	0.0606	0.7234	...	0.0076	0.4110	0.115364	0.040293	43263	11755	0.0601	0.1895	0.1104	urban
77	2020-11-23	2020-11-24	-0.0862	0.8662	0.281956	0.171326	43263	20478	0.0480	0.6022	...	0.0362	0.4789	0.155845	0.045055	43263	20478	0.0928	0.2372	0.1509	urban
78	2020-12-01	2020-12-02	-0.5858	0.8264	0.253933	0.168746	43263	24436	0.0272	0.5778	...	0.0000	0.4830	0.116662	0.041018	43263	24436	0.0597	0.1894	0.1123	urban
79	2020-12-13	2020-12-14	-0.9310	0.9920	0.218876	0.219074	43263	42317	-0.0068	0.6326	...	0.0111	0.3364	0.126812	0.053004	43263	42317	0.0447	0.2163	0.1271	urban
80	2020-12-26	2020-12-27	-0.1758	0.9944	0.334077	0.200384	43263	11755	0.0534	0.6982	...	0.0068	0.4318	0.106491	0.040919	43263	11755	0.0494	0.1779	0.1021	urban

253 rows × 156 columns

Let’s plot NDVI time series. The number of irregularities in the series has decreased because we masked out cloudy pixels.

[17]:

fig, ax = plt.subplots(figsize=(15, 8))
ax.set_ylabel("NDVI mean")

for idx, land_type in enumerate(polygons_gdf["land_type"].values):
    series = features_df[features_df["land_type"] == land_type]

    series.plot(ax=ax, x="interval_from", y="indices_NDVI_mean", color=f"C{idx}", label=land_type)

    ax.fill_between(
        series.interval_from.values,
        series["indices_NDVI_mean"] - series["indices_NDVI_stDev"],
        series["indices_NDVI_mean"] + series["indices_NDVI_stDev"],
        color=f"C{idx}",
        alpha=0.3,
    );

../_images/examples_statistical_request_33_0.png