API Reference¶
Rain gauge matching¶
create_output_dataframes(matches)
¶
Converts list of match classes into output dataframes with nested station class objects broken out into said output dataframes
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
matches
|
list
|
list of match class objects |
required |
Returns:
| Type | Description |
|---|---|
csv
|
pandas.Dataframe saved as csv for each type of match available (accepted, rank-rejected, auto-rejected) |
generate_manual_station_matching_notebook(output_dir)
¶
Generate manual station matching ipynb from template (either allowing for backups or not).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
Path to outputs |
required |
generate_manual_station_matching_script(output_dir, matching_script)
¶
Generate manual station matching script from template (either allowing for backups or not).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
Path to outputs |
required |
matching_script
|
Which template script to copy across i.e. with or without backups |
required |
run_matching_algorithm(left_df, right_df, save_outputs_to_csv=False, output_dir='outputs', save_manual_matching_script=False, save_manual_matching_notebook=False, allow_backups=False, overwrite_existing=False)
¶
Wrapper function to run matching algorithm for two sets of stations / gauges
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left_df
|
DataFrame
|
contains rows of stations with names and co-ordinates |
required |
right_df
|
DataFrame
|
contains rows of stations with names and co-ordinates |
required |
save_outputs_to_csv
|
bool
|
Whether to save the outputs to csv (default: False) |
False
|
output_dir
|
str
|
Path to outputs |
'outputs'
|
save_manual_matching_script
|
bool
|
Whether to generate manual matching script (default: False) |
False
|
save_manual_matching_notebook
|
bool
|
Whether to generate manual matching noebook (default: False) |
False
|
allow_backups
|
bool
|
Whether the outputted manual station matching should allow for backups (default: False) |
False
|
overwrite_existing
|
bool
|
Whether to overwrite existing data, scripts and/or notebooks under output_dir |
False
|
Returns:
| Type | Description |
|---|---|
list
|
contains Match class objects for pairs which returned a match |
DataFrame
|
dataframe with each row containing an automatically accepted match for two stations |
DataFrame
|
dataframe with each row containing an rank-rejected match for two stations (it scored worse than an accepted match for the left-hand station) |
DataFrame
|
dataframe with each row containing an automatically rejected match for two stations (match was detected but with a worse score than the threshold so subject to manual review) |
Rain gauge comparison¶
row_from_comparison(comparison)
¶
Generate dictionary from comparison object that will form row of output dataframe
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
comparison
|
RainGauge_Comparison
|
comparison containing attributes to be stored in dictionary |
required |
Returns:
| Type | Description |
|---|---|
dict
|
dictionary containing information gathered from class attributes |
run_comparison_algorithm(gauge_pair_metadata, left_hand_file_path, right_hand_file_path, datetime_format)
¶
Wrapper function to run comparison on timeseries at two gauges and store output
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gauge_pairs
|
each sub-list contains a pair of gauge-ids as strings |
required | |
left_hand_file_path
|
str
|
location of timeseries information containing left-hand gauge timeseries file |
required |
right_hand_file_path
|
str
|
location of timeseries information containing right-hand gauge timeseries file |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
dataframe with each row as a summary comparison of gauge timeseries |
Rain gauge matching classes¶
Comparison
¶
Bases: object
An object comprised of two RainGauges with attributes and methods for outlining basic statistical information about the gauge timeseries
__init__(primary_gauge, secondary_gauge)
¶
Attributes:
| Name | Type | Description |
|---|---|---|
primary_gauge |
RainGauge_Comparison
|
gauge from the primary network |
secondary_gauge |
RainGauge_Comparison
|
gauge from the secondary network |
raw_timeseries |
DataFrame
|
contains all time-steps of timeseries data for overlapping period of both gauges start and end dates |
timeseries |
DataFrame
|
contains non-NaN timeseries data for overlapping period of both gauges start and end dates |
non_zero_timeseries |
DataFrame
|
contains non-zero timeseries data for overlapping period of both gauges start and end dates |
overlap_start_date |
DateTime
|
first timestep of overlap between self.primary_gauge.timeseries and self.secondary_gauge.timeseries |
overlap_end_date |
DateTime
|
final timestep of overlap between self.primary_gauge.timeseries and self.secondary_gauge.timeseries |
overlap_length |
TimeDelta
|
length of overlap from first shared timestep to last shared timestep |
overlap_timesteps |
int
|
count of timesteps in overlapping period |
good_timesteps |
int
|
count of timesteps where both gauges record non-NaN values |
nan_timesteps |
int
|
count of timesteps where one or both gauges record NaN value |
sum_of_good_timesteps |
TimeDelta
|
sum of good timesteps shared by both timeseries |
identical_rows |
int
|
count of timesteps with identical measured parameter values or where both are NaN |
identical_non_nan_rows |
int
|
count of timesteps with identical measured parameter values |
identical_rows_percentage |
str
|
print out of identical timesteps / total timesteps as a percentage |
identical_non_nan_rows_percentage |
str
|
print out of identical non-NaN timesteps / total non-NaN timesteps as a percentage |
primary_accumulation |
float
|
sum of measured parameter at primary gauge across entire overlap period |
secondary_accumulation |
float
|
sum of measured parameter at secondary gauge across entire overlap period |
accumulation_difference |
float
|
absolute difference between primary and secondary accumulation |
accumulation_difference_percentage |
str
|
print out of (primary accumulation / secondary accumulation) - 1 as a percentage, [-100%, +100%] for nothing in primary secondary respectively |
r_squared |
float
|
r-squared value for overlapping period of two gauges |
spcc |
float
|
Spearmans correlation coefficient for overlapping period of two gauges |
non_zero_r_sqaured |
float
|
r-squared value for non-zero timesteps during overlapping period two gauges |
non_zero_spcc |
float
|
Spearmans correlation coefficient for non-zero timesteps during overlapping period two gauges |
get_accumulation_information()
¶
Calculate summary statistics for accumulation at pair of stations / gauges
get_overlap()
¶
Identifies overlapping period between start and end dates of stations / gauges within pair
get_row_information()
¶
Calculate summary statistics for timestep similarity at pair of stations / gauges
get_statistical_information()
¶
Calculate summary statistics for correlation between a pair of stations / gauges
get_timeseries()
¶
Gets relevant timeseries (and metadata) using overlapping period identified for both stations / gauges
get_timestep_information()
¶
Identify number of 'good' timesteps shared by a pair of matched stations / gauges
prepare_comparison()
¶
Run comparison functions
Match
¶
Bases: object
An object comprised of two RainGauges with attributes and methods for defining how well their metadata matches
__init__(match_type, station_left, station_right, distance_score, distance_metres, string_score, common_substrings, common_banned_substrings=None, match_score=None)
¶
get_banned_common_strings()
¶
Check if this is redundant
set_auto_rejected()
¶
Determines if scores meet criteria for an auto-rejected match
set_match_score()
¶
Calculates and sets match score from product of distance and string score
Returns:
| Type | Description |
|---|---|
int
|
score in [0, 1, 2, 3, 4, 6, 8, 1000, 2000, 3000, 4000, 6000, 8000] |
RainGauge
¶
Bases: object
The most basic gauge object with a name, id, co-ordinates and a source
__init__(id, name, source='Unspecified', easting=np.nan, northing=np.nan)
¶
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
proverbial name of the station e.g., blue-moutain station |
id |
str
|
reference id of the station e.g., ABC001 |
source |
str
|
source of data e.g., random_API |
easting |
float
|
EPSG27700 Easting value |
northing |
float
|
EPSG27700 Northing value |
Functions:
| Name | Description |
|---|---|
# TODO: Add class method to create raingauges |
|
get_coordinates()
¶
Combines self.easting and self.northing into a geometry object
Returns:
| Type | Description |
|---|---|
Point
|
coordinates in EPSG:27700 (British National Grid) |
RainGauge_Comparison
¶
Bases: RainGauge
A verision of the RainGauge with a basic timeseries and datetime metadata for more detailed comparison with another gauge
__init__(*, folder_path, datetime_format, **kwargs)
¶
Attributes:
| Name | Type | Description |
|---|---|---|
folder_path |
str
|
where the timeseries file for this gauge is stored, filename should be a csv with id or name as filename e.g., ABC001.csv or blue-mountain station.csv |
datetime_format |
datetime format used in gauge timeseries files |
|
timeseries |
DataFrame
|
timeseries with data, generated from csv at location "{self.folder_path}/{self.id}.csv" |
start_date |
DateTime
|
first timestep in self.timeseries |
end_date |
DateTime
|
last timestep in self.timeseries |
get_coordinates()
¶
Combines self.easting and self.northing into a geometry object
Returns:
| Type | Description |
|---|---|
Point
|
coordinates in EPSG:27700 (British National Grid) |
get_dates()
¶
Extracts first and last timestep from timeseries dataframe
Returns:
| Type | Description |
|---|---|
DateTime
|
first timestep in timeseries |
DateTime
|
final timestep in timeseries |
get_timeseries()
¶
Extracts timeseries from csv file and checks those files contain correctly named columns
Returns:
| Type | Description |
|---|---|
DataFrame
|
timeseries containing datetime and measured parameter information |
prepare_gauge()
¶
Runs functions to prepare gauge for timeseries comparison
RainGauge_Matching
¶
Bases: RainGauge
A verision of the RainGauge with attributes and methods for matching metadata between gauges
__init__(*, banned_strings=None, **kwargs)
¶
Attributes:
| Name | Type | Description |
|---|---|---|
banned_strings |
list
|
Begins empty, is calculated later based on frequency of sub-string occurence across all gauges |
get_all_substrings()
¶
Generates all alphanumeric substrings of a string
Returns:
| Type | Description |
|---|---|
set
|
unique sub-strings |
get_allowable_substrings()
¶
Generates all allowable (not in self.banned) alphanumeric substrings of a string
Returns:
| Type | Description |
|---|---|
set
|
unique sub-strings that are not banned from the matching process |
get_common_substrings(other, mode=None)
¶
Generates all common substrings between two station naming strings
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
str
|
toggle for whether to exclude banned strings ('all' to ignore bans) |
None
|
Returns:
| Type | Description |
|---|---|
set
|
unique common sub-strings |
get_coordinates()
¶
Combines self.easting and self.northing into a geometry object
Returns:
| Type | Description |
|---|---|
Point
|
coordinates in EPSG:27700 (British National Grid) |
get_distance(other)
¶
Calculates Euclidean distance between two sets of coordinates
Returns:
| Type | Description |
|---|---|
float
|
distance rounded to the nearest integer |
get_distance_score(other)
¶
Calculates distance score based on distance between two sets of coordinates
Returns:
| Type | Description |
|---|---|
int
|
distance score in [-1, 0, 1, 2, 3, 999] |
get_match(other)
¶
Generates Match object if scoring criteria is met for pair of stations / gauges
Returns:
| Type | Description |
|---|---|
Match
|
object containing gauges and calculated scores |
get_string_score(other)
¶
Calculates string score based on commonality of sub-strings (left hand station has priority for counting unique sub-strings)
Returns:
| Type | Description |
|---|---|
int
|
score based on string / sub-string commonality (-1 = identical basic string, n = number of sub-strings of left not in right, 999 = no commonality) |
set_banned_strings(banned_strings)
¶
Sets banned_strings attribute of a station
convert_to_pandas_datetime(df, col_to_convert, datetime_format)
¶
Convert designated column in pandas to datetime format
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col_to_convert
|
str
|
name of column to be converted |
required |
existing_format
|
date format present in designated column |
required |
Returns:
| Type | Description |
|---|---|
Dataframe
|
copy of the dataframe with datetime column formatted |
Utils¶
required_columns(required_columns=REQUIRED_COLUMNS, easting_col='easting', northing_col='northing', data_names=None)
¶
Decorator to ensure required columns exist in one or more DataFrame arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required_columns
|
list
|
Columns that must exist |
REQUIRED_COLUMNS
|
easting_col
|
str
|
Special columns for custom error message |
'easting'
|
northing_col
|
str
|
Special columns for custom error message |
'easting'
|
data_names
|
list[str]
|
Names of data to check |
None
|
required_comparison_columns(required_columns=REQUIRED_COMPARISON_COLUMNS, data_names=None)
¶
Decorator to ensure required columns exist in one or more DataFrame arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required_columns
|
list
|
Columns that must exist |
REQUIRED_COMPARISON_COLUMNS
|
data_names
|
list[str]
|
Names of data to check |
None
|
required_timeseries_columns(required_columns=REQUIRED_TIMESERIES_COLUMNS, data_names=None)
¶
Decorator to ensure required columns exist in one or more DataFrame arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required_columns
|
list
|
Columns that must exist |
REQUIRED_TIMESERIES_COLUMNS
|
data_names
|
list[str]
|
Names of data to check |
None
|
crs_to_crs(df, crs_in, crs_out, east_west_col_in, north_south_col_in, east_west_col_out, north_south_col_out)
¶
Convert from one CRS projection to another
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data to convert to another CRS |
required |
crs_in
|
str | int
|
Projection of current data (e.g. 4326) |
required |
crs_out
|
str | int
|
Target projection (e.g. 27700) |
required |
east_west_col_in
|
str
|
Name of eastward column of original projection |
required |
north_south_col_in
|
str
|
Name of northward column of original projection |
required |
east_west_col_out
|
str
|
Name of eastward column of target projection |
required |
north_south_col_out
|
str
|
Name of northward column of target projection |
required |
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
Data with new target projection columns |