Geometric clipping (#93)

wfvining · cwhanse · web-flow · commit 979e114a4955 · 2020-11-18T12:34:05.000-07:00
* Test fixtures for new geometric clipping fixture

Set up to pass pdc0 for PVWatts inverter model to generate data
with clipping. 1-minute timestamp spacing can be downsampled
for tests at lower frequencies.

* Test that data with/without clipping is correctly identified

* Register the pdc0_inverter pytest.mark

pytest warns on unregistered marks to prevent typos, register this
mark so we can use it safely.

* Parametrize tests by data frequency

Test at 1, 15, 30, and 60 minute frequencies.

* Test that the correct data is flagged as clipped

Also expand parametrization to include 5 minute data.

* Down-sample data with frequency less than 10 minutes

A different method is used to calculate the clipping threshold
when data is down-sampled.

* Test with "simulated" midday cloudy period

* Use larger default window for tracking systems

* Test passing a larger window results in no clipping detected

* Test tracking=True parameter

* Don't test correctness at 1-minute frequency

Because some not-clipped data is very close to the clipping
threshold at 1-minute timestamp spacing, creating a test that
exactly captures the expected output is unreasonable. We have
tests that ensures clipping is detected at that frequency if
and only if it is present.

* Add features.clipping.geometric to api.rst

* Allow some values that are not clipped for high-frequency data

Values that are very near the clipping level cannot be reliably
distinguished from clipped data.

* Use cythonized kernel functions

Substantial performance improvements by using .transform('max')
instead of .transform(lambda xs: data[xs.index][xs].max()). Some
minor additional work is required to select the correct data before
applying .transform()

* Add test with irregular and missing data

* Raise a ValueError if ac_power is not sorted

because we are rolling over the integer indices, not time windows
the data must be sorted.

* Adjust min/max threshold to be below/above true min/max

The previous approach was to round to 8 decimal places; however,
this fails when both min and max are rounded up to the same value.
The solution implemented here is slightly more complex, but ensures
that the thresholds are adjusted in the correct direction (maximum
increases if it is less than the true minimum and minimum decreases
if it is greater than the true maximum).

Co-authored-by: Cliff Hansen &lt;cwhanse@sandia.gov&gt;
diff --git a/docs/api.rst b/docs/api.rst
@@ -188,6 +188,7 @@ Functions for identifying inverter clipping
 
    features.clipping.levels
    features.clipping.threshold
+   features.clipping.geometric
 
 Clearsky
 --------
diff --git a/pvanalytics/features/clipping.py b/pvanalytics/features/clipping.py
@@ -227,3 +227,207 @@ def threshold(ac_power, slope_max=0.0035, power_min=0.75,
         freq=freq
     )
     return ac_power >= threshold
+
+
+def _freq_minutes(index, freq):
+    """Return the frequency in minutes for `freq`. If `freq` is None
+    then use the frequency inferred from `index`."""
+    if freq is None:
+        freq = pd.infer_freq(index)
+    if freq is None:
+        raise ValueError("cannot infer frequency")
+    return util.freq_to_timedelta(freq).seconds / 60
+
+
+def _apply_daily_mask(mask, data, transformation):
+    """Apply `f` to the data selected by `mask` on each day.
+
+    Parameters
+    ----------
+    mask : Series
+        Boolean Series with same index as `data`
+    data : Series
+        Series with the data.
+    transformation : str or function
+        Any value that can be passed to ``Series.resample().transform()``.
+
+    Returns
+    -------
+    Series
+        Series with same index as `mask` and values assigned by applying
+        transformation to data in ``data[mask]`` on each day.
+    """
+    data = data.copy()
+    data[~mask] = np.nan
+    return data.resample('D').transform(transformation)
+
+
+def _threshold_mean(mask, data):
+    """Return daily thresholds based on mean and standard deviation.
+
+    Parameters
+    ----------
+    mask : Series
+        Boolean series.
+    data : Series
+        Data with same index as `mask`.
+
+    Returns
+    -------
+    minimum : Series
+        `data` transformed to the mean of ``data[mask]`` minus 2 times
+         the standard deviation of ``data[mask]`` on each day.
+    maximum : Series
+        `data` transformed to the mean of ``data[mask]`` plus 2 times
+         the standard deviation of ``data[mask]`` on each day.
+    """
+    daily_mean = _apply_daily_mask(mask, data, 'mean')
+    daily_std = _apply_daily_mask(mask, data, 'std')
+    daily_clipped_max = daily_mean + 2 * daily_std
+    daily_clipped_min = daily_mean - 2 * daily_std
+    # In cases where the standard deviation is 0 (i.e. all the data is
+    # identical) it is possible for the mean to be above the daily maximum
+    # by a very small amount due to floating point rounding errors. To ensure
+    # that rounding errors do not affect the final outcome we lower the daily
+    # clipping minimum if it is greater than the maximum for that day and
+    # raise the daily clipping maximum if it is less than the minimum for
+    # that day.
+    daily_min, daily_max = _threshold_minmax(mask, data)
+    min_above_max = daily_clipped_min > daily_max
+    max_below_min = daily_clipped_max < daily_min
+    daily_clipped_min[min_above_max] = daily_max[min_above_max]
+    daily_clipped_max[max_below_min] = daily_min[max_below_min]
+    return daily_clipped_min, daily_clipped_max
+
+
+def _threshold_minmax(mask, data):
+    """Return daily thresholds based on min and max.
+
+    Parameters
+    ----------
+    mask : Series
+        Boolean series
+    data : Series
+        Data with same index as `mask`.
+
+    Returns
+    -------
+    minimum : Series
+        `data` transformed to have the minimum value from ``data[mask]``
+        on each day.
+    maximum : Series
+        `data` transformed to have the maximum value from ``data[mask]``
+        on each day.
+    """
+    daily_max = _apply_daily_mask(mask, data, 'max')
+    daily_min = _apply_daily_mask(mask, data, 'min')
+    return daily_min, daily_max
+
+
+def _rolling_low_slope(ac_power, window, slope_max):
+    """Return True for timestamps where the data has slope less
+    than `slope_min` over a rolling window of length `window."""
+    # Reverse the series to do a forward looking (left-labeled)
+    # rolling max/min.
+    rolling_max = ac_power[::-1].rolling(
+        window=window).max().reindex_like(ac_power)
+    rolling_min = ac_power[::-1].rolling(
+        window=window).min().reindex_like(ac_power)
+    # calculate an upper bound on the derivative
+    derivative_max = ((rolling_max - rolling_min)
+                      / ((rolling_max + rolling_min) / 2) * 100)
+    clipped = derivative_max < slope_max
+    clipped_windows = clipped.copy()
+    # flag all points in a window that has clipping
+    for i in range(0, window):
+        clipped_windows |= clipped.shift(i)
+    return clipped_windows
+
+
+def geometric(ac_power, window=None, slope_max=0.2, freq=None,
+              tracking=False):
+    """Identify clipping based on a the shape of the `ac_power`
+    curve on each day.
+
+    Each day is checked for periods where the slope of `ac_power`
+    is small. The power values in these periods are used to calculate
+    a minimum and a maximum clipped power level for that day. Any
+    power values that are within this range are flagged as
+    clipped. The methodology for computing the thresholds varies
+    depending on the frequency of `ac_power`. For high frequency
+    data (less than 10 minute timestamp spacing) the minimum
+    clipped power is the mean of the low-slope period(s) on that
+    day minus 2 times the standard deviation in the same period(s).
+    For lower frequency data the absolute minimum and maximum of
+    the low slope period(s) on each day are used.
+
+    If the frequency of `ac_power` is less than ten minutes, then
+    `ac_power` is down-sampled to 15 minutes and the mean value in
+    each 15-minute period is used to reduce noise inherent in
+    high frequency data.
+
+    Parameters
+    ----------
+    ac_power : Series
+        AC power data.
+    window : int, optional
+        Size of the rolling window used to identify low-slope
+        periods. If not specified and `tracking` is False then
+        `window=3` is used. If not specified and `tracking` is
+        True then `window=5` is used.
+    slope_max : float, default 0.2
+        Maximum difference in maximum and minimum power for a
+        window to be flagged as clipped. Units are percent of
+        average power in the interval.
+    freq : str, optional
+        Frequency of `ac_power`. If not specified then
+        :py:func:`pandas.infer_freq` is used.
+    tracking : bool, default False
+        If True then a larger default `window` is used. If `window`
+        is specified then `tracking` has no affect.
+
+    Returns
+    -------
+    Series
+        Boolean Series with True for values that appear to be clipped.
+
+    Raises
+    ------
+    ValueError
+        If the index of `ac_power` is not sorted.
+
+    Notes
+    -----
+    Based on code from the PVFleets QA project.
+    """
+    if not ac_power.index.is_monotonic_increasing:
+        raise ValueError("Index must be monotonically increasing.")
+    ac_power_original = ac_power.copy()
+    ac_power = ac_power_original
+    try:
+        freq_minutes = _freq_minutes(ac_power.index, freq)
+    except ValueError:
+        raise ValueError("Cannot infer frequency of `ac_power`. "
+                         "Please resample or pass `freq`.")
+    if freq_minutes < 10:
+        ac_power = ac_power.resample('15T').mean()
+    if window is None and tracking and freq_minutes < 30:
+        window = 5
+    else:
+        window = window or 3
+    # remove low power times to eliminate night.
+    daily_min = ac_power.resample('D').transform('max') * 0.1
+    ac_power.loc[ac_power < daily_min] = np.nan
+    clipped = _rolling_low_slope(ac_power, window, slope_max)
+    if not ac_power.index.equals(ac_power_original.index):
+        # data was down-sampled.
+        daily_clipped_min, daily_clipped_max = _threshold_mean(
+            clipped.reindex_like(ac_power_original, method='ffill'),
+            ac_power_original
+        )
+    else:
+        daily_clipped_min, daily_clipped_max = _threshold_minmax(
+            clipped, ac_power_original
+        )
+    return ((ac_power_original >= daily_clipped_min)
+            & (ac_power_original <= daily_clipped_max))
diff --git a/pvanalytics/tests/conftest.py b/pvanalytics/tests/conftest.py
@@ -14,6 +14,9 @@ def pytest_addoption(parser):
 
 def pytest_configure(config):
     config.addinivalue_line("markers", "slow: mark test as slow to run")
+    config.addinivalue_line("markers", "pdc0_inverter: pass inverter"
+                                       "DC input limit to fixture that"
+                                       "models AC power using PVWatts")
 
 
 def pytest_collection_modifyitems(config, items):
diff --git a/pvanalytics/tests/features/test_clipping.py b/pvanalytics/tests/features/test_clipping.py
@@ -3,6 +3,8 @@
 from pandas.util.testing import assert_series_equal
 import numpy as np
 import pandas as pd
+from pvlib import irradiance, temperature, pvsystem, inverter
+from pvlib.temperature import TEMPERATURE_MODEL_PARAMETERS
 from pvanalytics.features import clipping
 
 
@@ -242,3 +244,132 @@ def test_threshold_no_clipping_four_days(quadratic):
     clipped = clipping.threshold(power)
 
     assert not clipped.any()
+
+
+@pytest.fixture(scope='module')
+def july():
+    return pd.date_range(start='7/1/2020', end='8/1/2020', freq='T')
+
+
+@pytest.fixture(scope='module')
+def clearsky_july(july, albuquerque):
+    return albuquerque.get_clearsky(
+        july,
+        model='simplified_solis'
+    )
+
+
+@pytest.fixture(scope='module')
+def solarposition_july(july, albuquerque):
+    return albuquerque.get_solarposition(july)
+
+
+@pytest.fixture
+def power_pvwatts(request, clearsky_july, solarposition_july):
+    pdc0 = 100
+    pdc0_inverter = 110
+    tilt = 30
+    azimuth = 180
+    pdc0_marker = request.node.get_closest_marker("pdc0_inverter")
+    if pdc0_marker is not None:
+        pdc0_inverter = pdc0_marker.args[0]
+    poa = irradiance.get_total_irradiance(
+        tilt, azimuth,
+        solarposition_july['zenith'], solarposition_july['azimuth'],
+        **clearsky_july
+    )
+    cell_temp = temperature.sapm_cell(
+        poa['poa_global'], 25, 0,
+        **TEMPERATURE_MODEL_PARAMETERS['sapm']['open_rack_glass_glass']
+    )
+    dc = pvsystem.pvwatts_dc(poa['poa_global'], cell_temp, pdc0, -0.004)
+    return inverter.pvwatts(dc, pdc0_inverter)
+
+
+@pytest.mark.parametrize('freq', ['T', '5T', '15T', '30T', 'H'])
+def test_geometric_no_clipping(power_pvwatts, freq):
+    clipped = clipping.geometric(power_pvwatts.resample(freq).asfreq())
+    assert not clipped.any()
+
+
+@pytest.mark.pdc0_inverter(60)
+@pytest.mark.parametrize('freq', ['T', '5T', '15T', '30T', 'H'])
+def test_geometric_clipping(power_pvwatts, freq):
+    clipped = clipping.geometric(power_pvwatts.resample(freq).asfreq())
+    assert clipped.any()
+
+
+@pytest.mark.pdc0_inverter(65)
+@pytest.mark.parametrize('freq', ['5T', '15T', '30T', 'H'])
+def test_geometric_clipping_correct(power_pvwatts, freq):
+    power = power_pvwatts.resample(freq).asfreq()
+    clipped = clipping.geometric(power)
+    expected = power == power.max()
+    if freq == '5T':
+        assert np.allclose(power[clipped], power.max(), atol=0.5)
+    else:
+        assert_series_equal(clipped, expected)
+
+
+@pytest.mark.pdc0_inverter(65)
+def test_geometric_clipping_midday_clouds(power_pvwatts):
+    power = power_pvwatts.resample('15T').asfreq()
+    power.loc[power.between_time(
+        start_time='17:30', end_time='19:30',
+        include_start=True, include_end=True
+    ).index] = list(range(30, 39)) * 31
+    clipped = clipping.geometric(power)
+    expected = power == power.max()
+    assert_series_equal(clipped, expected)
+
+
+@pytest.mark.pdc0_inverter(80)
+def test_geometric_clipping_window(power_pvwatts):
+    power = power_pvwatts.resample('15T').asfreq()
+    clipped = clipping.geometric(power)
+    assert clipped.any()
+    clipped_window = clipping.geometric(power, window=24)
+    assert not clipped_window.any()
+
+
+@pytest.mark.pdc0_inverter(89)
+def test_geometric_clipping_tracking(power_pvwatts):
+    power = power_pvwatts.resample('15T').asfreq()
+    clipped = clipping.geometric(power)
+    assert clipped.any()
+    clipped = clipping.geometric(power, tracking=True)
+    assert not clipped.any()
+
+
+@pytest.mark.pdc0_inverter(80)
+def test_geometric_clipping_window_overrides_tracking(power_pvwatts):
+    power = power_pvwatts.resample('15T').asfreq()
+    clipped = clipping.geometric(power, tracking=True)
+    assert clipped.any()
+    clipped_override = clipping.geometric(power, tracking=True, window=24)
+    assert not clipped_override.any()
+
+
+@pytest.mark.parametrize('freq', ['5T', '15T'])
+def test_geometric_clipping_missing_data(freq, power_pvwatts):
+    power = power_pvwatts.resample(freq).asfreq()
+    power.loc[power.between_time('09:00', '10:30').index] = np.nan
+    power.loc[power.between_time('12:15', '12:45').index] = np.nan
+    power.dropna(inplace=True)
+    with pytest.raises(ValueError,
+                       match="Cannot infer frequency of `ac_power`. "
+                             "Please resample or pass `freq`."):
+        clipping.geometric(power)
+    assert not clipping.geometric(power, freq=freq).any()
+
+
+def test_geometric_index_not_sorted():
+    power = pd.Series(
+        [1, 2, 3],
+        index=pd.DatetimeIndex(
+            ['20200201 0700', '20200201 0630', '20200201 0730']
+        )
+    )
+    with pytest.raises(ValueError,
+                       match=r"Index must be monotonically increasing\."):
+        clipping.geometric(power, freq='30T')