Sentinel Hub Statistical API

This notebook shows how to use Sentinel Hub Statistical API to obtain aggregated statistical satellite data for areas of interest. For more information about Statistical API please check the official service documentation.

Prerequisites

Imports

The tutorial requires additional packages geopandas, matplotlib, and seaborn which are not dependencies of sentinelhub-py.

[1]:
%matplotlib inline

import json
import datetime as dt
from collections import defaultdict

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sentinelhub import SentinelHubStatistical, DataCollection, CRS, BBox, bbox_to_dimensions, \
    Geometry, SHConfig, parse_time, parse_time_interval, SentinelHubStatisticalDownloadClient

Credentials

Process API requires Sentinel Hub account. Please check configuration instructions about how to set up your Sentinel Hub credentials.

[2]:
from sentinelhub import SHConfig

config = SHConfig()

if not config.sh_client_id or not config.sh_client_secret:
    print("Warning! To use Statistical API, please provide the " \
          "credentials (OAuth client ID and client secret).")

Helper function

A helper function that will be used in this tutorial.

[3]:
def stats_to_df(stats_data):
    """ Transform Statistical API response into a pandas.DataFrame
    """
    df_data = []

    for single_data in stats_data['data']:
        df_entry = {}
        is_valid_entry = True

        df_entry['interval_from'] = parse_time(single_data['interval']['from']).date()
        df_entry['interval_to'] = parse_time(single_data['interval']['to']).date()

        for output_name, output_data in single_data['outputs'].items():
            for band_name, band_values in output_data['bands'].items():

                band_stats = band_values['stats']
                if band_stats['sampleCount'] == band_stats['noDataCount']:
                    is_valid_entry = False
                    break

                for stat_name, value in band_stats.items():
                    col_name = f'{output_name}_{band_name}_{stat_name}'
                    if stat_name == 'percentiles':
                        for perc, perc_val in value.items():
                            perc_col_name = f'{col_name}_{perc}'
                            df_entry[perc_col_name] = perc_val
                    else:
                        df_entry[col_name] = value

        if is_valid_entry:
            df_data.append(df_entry)

    return pd.DataFrame(df_data)

Make a Statistical API request

In the Process API tutorial, we have seen how to obtain satellite imagery. Statistical API can be used in a very similar way. The main difference is that the results of Statistical API are aggregated statistical values of satellite data instead of entire images. In many use cases, such values are all that we need. By using Statistical API we can avoid downloading and processing large amounts of satellite data.

Let’s take the same bounding box of Betsiboka Estuary, used in Process API tutorial, and make a Statistical API request.

[4]:
betsiboka_bbox = BBox([46.16, -16.15, 46.51, -15.58], CRS.WGS84)

rgb_evalscript = """
//VERSION=3

function setup() {
  return {
    input: [
      {
        bands: [
          "B02",
          "B03",
          "B04",
          "dataMask"
        ]
      }
    ],
    output: [
      {
        id: "rgb",
        bands: ["R", "G", "B"]
      },
      {
        id: "dataMask",
        bands: 1
      }
    ]
  }
}

function evaluatePixel(samples) {
    return {
      rgb: [samples.B04, samples.B03, samples.B02],
      dataMask: [samples.dataMask]
    };
}
"""

rgb_request = SentinelHubStatistical(
    aggregation=SentinelHubStatistical.aggregation(
        evalscript=rgb_evalscript,
        time_interval=('2020-06-07', '2020-06-13'),
        aggregation_interval='P1D',
        size=(631, 1047)
    ),
    input_data=[
        SentinelHubStatistical.input_data(
            DataCollection.SENTINEL2_L1C,
            maxcc=0.8
        )
    ],
    bbox=betsiboka_bbox,
    config=config
)

The following will send the request to Sentinel Hub service and obtain results.

[5]:
%%time

rgb_stats = rgb_request.get_data()[0]

rgb_stats
CPU times: user 55.3 ms, sys: 10.3 ms, total: 65.6 ms
Wall time: 20.3 s
[5]:
{'data': [{'interval': {'from': '2020-06-07T00:00:00Z',
    'to': '2020-06-08T00:00:00Z'},
   'outputs': {'rgb': {'bands': {'R': {'stats': {'min': 0.00419999985024333,
        'max': 0.7027999758720398,
        'mean': 0.1124257412122722,
        'stDev': 0.04222600867954154,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'B': {'stats': {'min': 0.054999999701976776,
        'max': 0.5311999917030334,
        'mean': 0.10198085449414412,
        'stDev': 0.01494034416514153,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'G': {'stats': {'min': 0.030400000512599945,
        'max': 0.5981000065803528,
        'mean': 0.09885213961499702,
        'stDev': 0.019804562978698967,
        'sampleCount': 660657,
        'noDataCount': 0}}}}}},
  {'interval': {'from': '2020-06-12T00:00:00Z', 'to': '2020-06-13T00:00:00Z'},
   'outputs': {'rgb': {'bands': {'R': {'stats': {'min': 0.004600000102072954,
        'max': 0.7160000205039978,
        'mean': 0.11546704480268109,
        'stDev': 0.06332157724593372,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'B': {'stats': {'min': 0.05920000001788139,
        'max': 0.5658000111579895,
        'mean': 0.11913311262009575,
        'stDev': 0.04636540384197817,
        'sampleCount': 660657,
        'noDataCount': 0}},
      'G': {'stats': {'min': 0.03779999911785126,
        'max': 0.6126000285148621,
        'mean': 0.11290437308774512,
        'stDev': 0.048338260231641964,
        'sampleCount': 660657,
        'noDataCount': 0}}}}}}],
 'status': 'OK'}

We obtained statistical data for pixels for each band and for both available acquisition dates.

Multiple requests for a collection of geometries

The real value of Statistical API shows for use cases where we have a large collection of small geometries, sparsely distributed over a large geographical area, and we would like to obtain statistical information for each of them.

In this example, we will take 4 small polygons, each marking a different land cover type, and collect NDVI values for each of them.

[6]:
polygons_gdf = gpd.read_file('data/statapi_test.geojson')

polygons_gdf
[6]:
land_type geometry
0 water POLYGON ((411873.909 5126285.627, 412036.018 5...
1 forest POLYGON ((451909.515 5087408.351, 452291.478 5...
2 agricultural POLYGON ((553495.985 5137450.826, 554130.662 5...
3 urban POLYGON ((548010.445 5156156.292, 549867.483 5...

For each polygon, we define a Statistical API request. Requested statistical data will be calculated on 10-meter resolution and aggregated per day with an available acquisition. We will also request histogram values.

[7]:
yearly_time_interval = '2020-01-01', '2020-12-31'

ndvi_evalscript = """
//VERSION=3

function setup() {
  return {
    input: [
      {
        bands: [
          "B04",
          "B08",
          "dataMask"
        ]
      }
    ],
    output: [
      {
        id: "ndvi",
        bands: 1
      },
      {
        id: "dataMask",
        bands: 1
      }
    ]
  }
}

function evaluatePixel(samples) {
    return {
      ndvi: [index(samples.B08, samples.B04)],
      dataMask: [samples.dataMask]
    };
}
"""

aggregation = SentinelHubStatistical.aggregation(
    evalscript=ndvi_evalscript,
    time_interval=yearly_time_interval,
    aggregation_interval='P1D',
    resolution=(10, 10)
)

input_data = SentinelHubStatistical.input_data(
    DataCollection.SENTINEL2_L2A
)

histogram_calculations = {
    "ndvi": {
        "histograms": {
            "default": {
                "nBins": 20,
                "lowEdge": -1.0,
                "highEdge": 1.0
            }
        }
    }
}

ndvi_requests = []

for geo_shape in polygons_gdf.geometry.values:
    request = SentinelHubStatistical(
        aggregation=aggregation,
        input_data=[input_data],
        geometry=Geometry(geo_shape, crs=CRS(polygons_gdf.crs)),
        calculations=histogram_calculations,
        config=config
    )
    ndvi_requests.append(request)

Instead of triggering download for each request separately, we can pass request objects together to a download client object, which will execute them in parallel using multiple threads. This way the download process will be faster.

[8]:
%%time

download_requests = [ndvi_request.download_list[0] for ndvi_request in ndvi_requests]

client = SentinelHubStatisticalDownloadClient(config=config)

ndvi_stats = client.download(download_requests)

len(ndvi_stats)
CPU times: user 116 ms, sys: 818 µs, total: 117 ms
Wall time: 11.1 s
[8]:
4

Let’s convert this data into a tabular form by transforming it into a pandas dataframe.

[9]:
ndvi_dfs = [stats_to_df(polygon_stats) for polygon_stats in ndvi_stats]

for df, land_type in zip(ndvi_dfs, polygons_gdf['land_type'].values):
    df['land_type'] = land_type

ndvi_df = pd.concat(ndvi_dfs)

ndvi_df
[9]:
interval_from interval_to ndvi_B0_min ndvi_B0_max ndvi_B0_mean ndvi_B0_stDev ndvi_B0_sampleCount ndvi_B0_noDataCount land_type
0 2020-01-01 2020-01-02 -0.043547 0.078686 0.035990 0.014587 9174 4322 water
1 2020-01-04 2020-01-05 0.013191 0.115481 0.059611 0.015508 9174 4322 water
2 2020-01-06 2020-01-07 -1.000000 0.812500 -0.203617 0.222675 9174 4322 water
3 2020-01-09 2020-01-10 -0.964912 0.962963 -0.029157 0.281115 9174 4322 water
4 2020-01-11 2020-01-12 -1.000000 0.947368 -0.032308 0.239918 9174 4322 water
... ... ... ... ... ... ... ... ... ...
140 2020-12-18 2020-12-19 -0.087464 0.067416 -0.006859 0.016857 43263 11755 urban
141 2020-12-21 2020-12-22 -0.072165 0.042030 -0.011940 0.009287 43263 11755 urban
142 2020-12-23 2020-12-24 -0.068307 0.718450 0.224862 0.123213 43263 11755 urban
143 2020-12-26 2020-12-27 -0.175758 0.994366 0.334077 0.200383 43263 11755 urban
144 2020-12-28 2020-12-29 -0.047584 0.093174 0.034037 0.016839 43263 11755 urban

505 rows × 9 columns

The following plot shows time series of mean values, buffered by standard deviation values.

[10]:
fig, ax = plt.subplots(figsize=(15, 8))

for idx, land_type in enumerate(polygons_gdf['land_type'].values):
    series = ndvi_df[ndvi_df['land_type'] == land_type]

    series.plot(
        ax=ax,
        x='interval_from',
        y='ndvi_B0_mean',
        color=f'C{idx}',
        label=land_type
    )

    ax.fill_between(
        series.interval_from.values,
        series['ndvi_B0_mean'] - series['ndvi_B0_stDev'],
        series['ndvi_B0_mean'] + series['ndvi_B0_stDev'],
        color=f'C{idx}',
        alpha=0.3
    );
../_images/examples_statistical_request_20_0.png

Let’s also plot histograms for a certain timestamp.

[11]:
TIMESTAMP_INDEX = 2

plot_data = []
timestamp = None
for idx, stats in enumerate(ndvi_stats):
    bins = stats['data'][TIMESTAMP_INDEX]['outputs']['ndvi']['bands']['B0']['histogram']['bins']
    timestamp = stats['data'][TIMESTAMP_INDEX]['interval']['from'].split('T')[0]

    counts = [value['count'] for value in bins]
    total_counts = sum(counts)
    counts = [round(100 * count / total_counts) for count in counts]

    bin_size = bins[1]['lowEdge'] - bins[0]['lowEdge']
    splits = [value['lowEdge'] + bin_size / 2 for value in bins]

    data = []
    for count, split in zip(counts, splits):
        data.extend([split] * count)
    plot_data.append(np.array(data))


fig, ax = plt.subplots(figsize=(12, 8))
ax = sns.violinplot(
    data=plot_data,
    ax=ax
)
ax.set(xticklabels=polygons_gdf['land_type'].values)
plt.xlabel('Land types', fontsize=15);
plt.ylabel(f'NDVI for {timestamp}', fontsize=15);
../_images/examples_statistical_request_22_0.png

Reduce service processing costs

In case of large-scale processing, it becomes important, how many processing units we spend making Statistical API requests. We can decrease this number if we write an evalscript that outputs integer values instead of floats.

In this example, we will download statistics for all Sentinel-2 L2A bands, cloud masks and probabilities, and a few remote sensing indices. Such a collection of values would typically be used as an input to a machine learning model.

The following evalscript will:

  • request band values in digital numbers, instead of reflectances,
  • use toUINT function to convert values from indices into integers that will fit into SampleType.UINT16,
  • mask pixels for which cloud mask CLM indicates that they contain clouds. Such pixels will not be included in statistics.
[12]:
with open('./data/statapi_evalscript.js', 'r') as fp:
    features_evalscript = fp.read()

print(features_evalscript)
//VERSION=3

function setup() {
  return {
    input: [{
      bands: ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12", "CLM", "CLP", "dataMask"],
      units: "DN"
    }],
    output: [
      {
        id: "bands",
        bands: ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
        sampleType: "UINT16"
      },
      {
        id: "masks",
        bands: ["CLM"],
        sampleType: "UINT16"
      },
      {
        id: "indices",
        bands: ["NDVI", "NDVI_RE1", "NBSI", "CLP"],
        sampleType: "UINT16"
      },
      {
        id: "dataMask",
        bands: 1
      }]
  }
}

function evaluatePixel(samples) {
    // Normalised Difference Vegetation Index and variation
    let NDVI = index(samples.B08, samples.B04);
    let NDVI_RE1 = index(samples.B08, samples.B05);

    // Bare Soil Index
    let NBSI = index((samples.B11 + samples.B04), (samples.B08 + samples.B02));

    // cloud probability normalized to interval [0, 1]
    let CLP = samples.CLP / 255.0;

    // masking cloudy pixels
    let combinedMask = samples.dataMask
    if (samples.CLM > 0) {
        combinedMask = 0;
    }

    const f = 5000;
    return {
        bands: [samples.B01, samples.B02, samples.B03, samples.B04, samples.B05, samples.B06,
                samples.B07, samples.B08, samples.B8A, samples.B09, samples.B11, samples.B12],
        masks: [samples.CLM],
        indices: [toUINT(NDVI, f), toUINT(NDVI_RE1, f), toUINT(NBSI, f), toUINT(CLP, f)],
        dataMask: [combinedMask]
    };
}

function toUINT(product, constant){
  // Clamp the output to [-1, 10] and convert it to a UNIT16
  // value that can be converted back to float later.
  if (product < -1) {
    product = -1;
  } else if (product > 10) {
    product = 10;
  }
  return Math.round(product * constant) + constant;
}

Let’s create requests for all 4 polygons. Additionally, we will request statistics for 5th, 50th and 95th percentiles.

[13]:
aggregation = SentinelHubStatistical.aggregation(
    evalscript=features_evalscript,
    time_interval=yearly_time_interval,
    aggregation_interval='P1D',
    resolution=(10, 10)
)

calculations = {
    "default": {
        "statistics": {
            "default": {
                "percentiles": {
                    "k": [5, 50, 95]
                }
            }
        }
    }
}

features_requests = []
for geo_shape in polygons_gdf.geometry.values:
    request = SentinelHubStatistical(
        aggregation=aggregation,
        input_data=[
            SentinelHubStatistical.input_data(DataCollection.SENTINEL2_L2A)
        ],
        geometry=Geometry(geo_shape, crs=CRS(polygons_gdf.crs)),
        calculations=calculations,
        config=config
    )

    features_requests.append(request)
[14]:
%%time

download_requests = [request.download_list[0] for request in features_requests]

client = SentinelHubStatisticalDownloadClient(config=config)

features_stats = client.download(download_requests)

len(features_stats)
CPU times: user 135 ms, sys: 18.3 ms, total: 153 ms
Wall time: 14.2 s
[14]:
4

Let’s convert service response into a dataframe. Compared to the previous example this dataframe will have more columns, as we requested more types of features, and fewer rows, as in some cases all pixels contained clouds and were masked.

[15]:
features_dfs = [stats_to_df(polygon_stats) for polygon_stats in features_stats]

for df, land_type in zip(features_dfs, polygons_gdf['land_type'].values):
    df['land_type'] = land_type

features_df = pd.concat(features_dfs)

features_df
[15]:
interval_from interval_to indices_CLP_min indices_CLP_max indices_CLP_mean indices_CLP_stDev indices_CLP_sampleCount indices_CLP_noDataCount indices_CLP_percentiles_50.0 indices_CLP_percentiles_5.0 ... bands_B09_min bands_B09_max bands_B09_mean bands_B09_stDev bands_B09_sampleCount bands_B09_noDataCount bands_B09_percentiles_50.0 bands_B09_percentiles_5.0 bands_B09_percentiles_95.0 land_type
0 2020-01-06 2020-01-07 5000.0 5098.0 5034.679308 26.986245 9174 4322 5039.0 5000.0 ... 0.0 93.0 3.355317 9.493755 9174 4322 1.0 0.0 11.0 water
1 2020-01-09 2020-01-10 5000.0 5039.0 5012.309151 12.455728 9174 4322 5020.0 5000.0 ... 1.0 249.0 50.922300 47.405314 9174 4322 31.0 1.0 128.0 water
2 2020-01-11 2020-01-12 5000.0 5255.0 5020.793899 22.882725 9174 4322 5020.0 5000.0 ... 0.0 303.0 23.561624 40.569825 9174 4322 10.0 0.0 93.0 water
3 2020-01-14 2020-01-15 5000.0 7882.0 5371.652165 526.251686 9174 4324 5078.0 5000.0 ... 132.0 742.0 289.403093 159.372937 9174 4324 204.0 143.0 593.0 water
4 2020-01-21 2020-01-22 5000.0 7941.0 5663.055029 695.239685 9174 4322 5510.0 5020.0 ... 52.0 675.0 262.527617 128.020716 9174 4322 246.0 78.0 525.0 water
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
76 2020-11-21 2020-11-22 5020.0 7647.0 5938.670242 482.312106 43263 11755 5824.0 5333.0 ... 739.0 4380.0 1626.769773 536.858212 43263 11755 1503.0 1038.0 2664.0 urban
77 2020-11-23 2020-11-24 5196.0 9647.0 6502.110467 836.759383 43263 20478 6294.0 5471.0 ... 991.0 4670.0 1919.495019 490.949295 43263 20478 1819.0 1405.0 2812.0 urban
78 2020-12-01 2020-12-02 5118.0 8471.0 6512.164604 671.922946 43263 24436 6510.0 5412.0 ... 748.0 3954.0 1600.711903 485.870607 43263 24436 1488.0 1064.0 2575.0 urban
79 2020-12-13 2020-12-14 5039.0 8059.0 6558.605708 704.985325 43263 42317 6314.0 5039.0 ... 709.0 1872.0 1147.692389 298.336497 43263 42317 1132.0 785.0 1773.0 urban
80 2020-12-26 2020-12-27 5020.0 8294.0 6000.444427 533.363801 43263 11755 5922.0 5255.0 ... 249.0 4054.0 1371.798654 489.142637 43263 11755 1247.0 880.0 2201.0 urban

253 rows × 156 columns

Next, we can rescale values back to the correct scale. The following will:

  • convert statistical values for bands from digital numbers to reflectances,
  • apply an inverse transformation of toUINT function on statistical values for indices.
[16]:
BANDS = DataCollection.SENTINEL2_L2A.bands
INDICES = ['NDVI', 'NDVI_RE1', 'NBSI', 'CLP']
STATISTICAL_QUANTITIES = ['mean', 'min', 'max', 'stDev', 'percentiles_5.0',
                          'percentiles_50.0', 'percentiles_95.0']

for band in BANDS:
    for stat in STATISTICAL_QUANTITIES:
        column_name = f'bands_{band}_{stat}'
        column = features_df[column_name]

        column = column / 10000.

        features_df[column_name] = column

for index in INDICES:
    for stat in STATISTICAL_QUANTITIES:
        column_name = f'indices_{index}_{stat}'
        column = features_df[column_name]

        if stat == 'stDev':
            column = column / 5000.
        else:
            column = (column - 5000.) / 5000.

        features_df[column_name] = column

features_df
[16]:
interval_from interval_to indices_CLP_min indices_CLP_max indices_CLP_mean indices_CLP_stDev indices_CLP_sampleCount indices_CLP_noDataCount indices_CLP_percentiles_50.0 indices_CLP_percentiles_5.0 ... bands_B09_min bands_B09_max bands_B09_mean bands_B09_stDev bands_B09_sampleCount bands_B09_noDataCount bands_B09_percentiles_50.0 bands_B09_percentiles_5.0 bands_B09_percentiles_95.0 land_type
0 2020-01-06 2020-01-07 0.0000 0.0196 0.006936 0.005397 9174 4322 0.0078 0.0000 ... 0.0000 0.0093 0.000336 0.000949 9174 4322 0.0001 0.0000 0.0011 water
1 2020-01-09 2020-01-10 0.0000 0.0078 0.002462 0.002491 9174 4322 0.0040 0.0000 ... 0.0001 0.0249 0.005092 0.004741 9174 4322 0.0031 0.0001 0.0128 water
2 2020-01-11 2020-01-12 0.0000 0.0510 0.004159 0.004577 9174 4322 0.0040 0.0000 ... 0.0000 0.0303 0.002356 0.004057 9174 4322 0.0010 0.0000 0.0093 water
3 2020-01-14 2020-01-15 0.0000 0.5764 0.074330 0.105250 9174 4324 0.0156 0.0000 ... 0.0132 0.0742 0.028940 0.015937 9174 4324 0.0204 0.0143 0.0593 water
4 2020-01-21 2020-01-22 0.0000 0.5882 0.132611 0.139048 9174 4322 0.1020 0.0040 ... 0.0052 0.0675 0.026253 0.012802 9174 4322 0.0246 0.0078 0.0525 water
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
76 2020-11-21 2020-11-22 0.0040 0.5294 0.187734 0.096462 43263 11755 0.1648 0.0666 ... 0.0739 0.4380 0.162677 0.053686 43263 11755 0.1503 0.1038 0.2664 urban
77 2020-11-23 2020-11-24 0.0392 0.9294 0.300422 0.167352 43263 20478 0.2588 0.0942 ... 0.0991 0.4670 0.191950 0.049095 43263 20478 0.1819 0.1405 0.2812 urban
78 2020-12-01 2020-12-02 0.0236 0.6942 0.302433 0.134385 43263 24436 0.3020 0.0824 ... 0.0748 0.3954 0.160071 0.048587 43263 24436 0.1488 0.1064 0.2575 urban
79 2020-12-13 2020-12-14 0.0078 0.6118 0.311721 0.140997 43263 42317 0.2628 0.0078 ... 0.0709 0.1872 0.114769 0.029834 43263 42317 0.1132 0.0785 0.1773 urban
80 2020-12-26 2020-12-27 0.0040 0.6588 0.200089 0.106673 43263 11755 0.1844 0.0510 ... 0.0249 0.4054 0.137180 0.048914 43263 11755 0.1247 0.0880 0.2201 urban

253 rows × 156 columns

Let’s plot NDVI time series. The number of irregularities in the series has decreased because we masked out cloudy pixels.

[17]:
fig, ax = plt.subplots(figsize=(15, 8))

for idx, land_type in enumerate(polygons_gdf['land_type'].values):
    series = features_df[features_df['land_type'] == land_type]

    series.plot(
        ax=ax,
        x='interval_from',
        y='indices_NDVI_mean',
        color=f'C{idx}',
        label=land_type
    )

    ax.fill_between(
        series.interval_from.values,
        series['indices_NDVI_mean'] - series['indices_NDVI_stDev'],
        series['indices_NDVI_mean'] + series['indices_NDVI_stDev'],
        color=f'C{idx}',
        alpha=0.3
    );
../_images/examples_statistical_request_33_0.png