Data "Aggregation" Algorithms

Overview
Manual Data Reaggregation
Automatic Data Integrity Checks and Reaggregation

Overview

"Backend" Algorithms

"Back-end" components are responsible for the retrieval of data from the physical devices. These components also manipulate the data in such a way that irrespective of the format and specifics of the "raw" data read from the devices, the data stored in the "readout" tables in the main site database is consistent. For example, an actual device may report energy in kWh, however these values are manipulated by the back-end so that the numbers in the "readout" tables represent Joules. The actual timestamp of the data may also be manipulated by the backend so that the data in the readout tables always comes with a timestamp that falls on regular intervals. So a data that actually refers to a reading taken at 14:00:24.0212 is actually stored in the readout tables with a timestamp of 14:00:00. This makes it easier for subsequent calculations to run, and for the data to be graphed etc.

"Site" Algorithms

The data loaded by the backend into the readout tables is then "aggregated" by algorithms (stored routines) in the site MySQL databases. We've internally referred to the process of data-processing done mainly within the site MySQL databases as "data aggregation." A more appropriate term would probably have been "data pre-processing" since aggregation is only a part of the transformation done by the site to ensure that the readout data is in a suitable format for consumption by the website.

Essentially, the following calculations/transformations are done:

Device-specific data is "aggregated" up to hourly, daily, monthly values
Device-specific data is combined to form "system-level" readings, so for example, a given system may have a reading at 14:00 which includes the data from devices measuring momentary power, cumulative energy, temperature, irradiance, etc.
"System-level" readings for subsystems (systems with parent systems) are aggregated together to obtain "aggregator-system" readings. For example, if a site has 3 subsystems, then the energy readings for a given moment are added together to obtain the energy reading for the whole site at that moment in time.
Certain values are calculated from the different types of readings, such as: estimated energy, effective availability, weather-corrected predicted power.

Manual Data Reaggregation

Occasionally it may become necessary to manually "re-aggregate" our data for a given device/system. Some scenarios which might lead to this:

Readout data is updated without subsequently notifying the aggregation algorithm of this fact
Some parameters of the site were input incorrectly, such as inverter count etc.
Previous irradiance data is attached to a site manually (for example a virtual irradiance device)

In such cases the "raw" data retrieved from the devices is correct in the master (MySQL) database, however calculated and aggregated values are not. Such values include:

Calculated values such as predicted energy, performance ratio, effective availability
Daily or monthly aggregated values (which appear in reports and on various parts of the site)
"Site Aggregator" system data, in other words, data stored for "Virtual Master Meter" type devices

To manually trigger a full "reaggregation" of all data for a system, log into the website as Demo or Zoltan, and open the site details page for the given energy site.

To reaggregate a specific device, look for the "Reaggregate Device" button:

To reaggregate the entire site, look for the link at the bottom of the page:

You can also click "Aggregation Log" to review aggregation events.

The process may take several minutes to complete. Once done, you should see "success: true":

The pop-up does not need to be open for the process to complete.

If there are many systems/subsystems, then don't wait for a result in the pop-up. Instead, close it and open the aggregation status page:

You can then see all of the devices marked for complete reaggregation:

You will know that the reaggregation has completed once the list empties and instead you see the reloaded devices under "Completely reaggregated:…".

Automatic Data Integrity Checks and Reaggregation

The aggregation algorithm runs semi-continuously and attempts to keep aggregated data up-to-date. It checks whether any new records have been added to the database since the last time it aggregated data for the given device. It can also be manually notified of any "back-loading" events where for some reason we updated past values in the database (that have already been aggregated).

However, it is still possible for aggregate data to be incorrect. For example, if previously aggregated readout data is manually altered, or rewritten by the back-end without notifying the aggregation algorithms of the "back-load" event. There may also be "bugs" in the aggregation algorithm that we haven't yet uncovered leading to incorrect aggregate values. To try and mitigate such issues, one algorithm (sp_identify_full_reaggregate_candidates) runs once every night and attempts to identify any incorrectly aggregated values, or missing aggregate data. It then flags the associated devices to be completely re-processed by the aggregation routines.

Currently this routine checks for the following issues, however, it could be improved upon and further checks could be added.

The first round of checks is fairly simple and as such completes fairly quickly:

First device_readout_daily does not match date of first readout: we have readout data in device_readout or DER, however values are missing in DRD completely, or the first day that we have values for is after the day of the first energy readout. Also check if we have data in DPR, but we don't have aggregated power values for that day in DRD.
First system_production does not match first readout: same as the previous check (First device_readout_daily does not match date of first readout), except that instead of checking DRD, we are checking SPD.
First system_consumption does not match first readout: same as above, except that we are checking SCD for PRIMARY SITE LOAD type devices and associated systems.
First system_net does not match first readout: same as above, except that we are checking SND for PRIMARY GRID NET type devices and associated systems.

Abbreviations used:

DR: device_readout
DER: device_energy_readout
DRD: device_readout_daily
DPR: device_power_readout
SPD: system_production_daily
SCD: system_consumption_daily
SND: system_net_daily

A second round of tests loops over any devices not identified in the first round. These tests take a long time to run so they are not set to run automatically, however, parameters can be set in the routines to force the checks to run. Here we loop through each device and first identify if there are any "gaps" in the daily aggregated values:

device_energy_readout exists but device_readout_daily does not
- or - device_readout exists but device_readout_daily does not
device_energy_readout exists but system_production_daily does not
- or - device_reaodut exists but system_production_daily does not
device_energy_readout exists but system_consumption_daily does not
- or - device_readout exists but system_consumption_daily does not
device_energy_readout exists but system_net_daily does not
- or - device_readout exists but system_consumption_daily does not
device_celltemp_readout exists but device_celltemp_readout_daily does not
device_irradiance_readout exists but device_irradiance_readout_daily does not
device_power_readout exists but device_power_readout_daily does not

sp_identify_full_reaggregate_candidates