Introduction¶
What is station matching?¶
Imagine you have recieved rainfall, or any kind of gauge data for that matter, from two different sources, it is likely they may use a different system of ids, names and even varying geo-location data (e.g., co-ordinate system, granularity).
Station matching is the process of pairing gauges from both networks using a combination of geo-spatial and string based metadata comparison.
What is the process of station matching?¶
All gauges in both networks are compared 1:1, both for distance and the similarity of sub-strings within the gauge names. The matches are then given a score and categorised as one of the following:
- Nothing - no matching characteristic, pair is ignored
- Accepted - high quality pairing(s), no better pairing exists
- Rank-Rejected - high quality pairing(s), better pairing exists
- Auto-Rejected - low quality pairings(s), no better pairing exists
What is the outcome of the station matching process?¶
Depending on which pipeline a user runs, they will either create 3 dataframes (accepted, rank-rejected, auto-rejected) of matches, or a final output dataframe.
The final output dataframes follows a manual-review script or notebook which is automatically generated for the user, allowing them to compare rank-rejected and auto-rejected matches to those automatically accepted, with a focus on a minimal amount of required user-input.
The user notebook creates a set of one-to-one or one-to-many pairings where they have the option to assign main and back-up gauges for those one-to-many pairings.
The process chooses to not allow many-to-many matches as this would imply there are identical stations in one or both sets of gauge metadata. Pairs of stations both with a primary and back-up gauge should ideally be matched on a one-to-one basis; main to main, back-up to back-up.
What is station comparison?¶
Station comparison is a deeper dive on matched gauges, it provides a basic summary analysis of the similarity between both timeseries, including:
- Overlap - total length, number of timesteps (all and non-NaN)
- Accumulation - total, difference (absolute and percentage)
- Statistcal summaries - R² (all and non-zero), Spearmans Rank (all and non-zero)
This can either be used to assess the quality of generated pairings, or to quickly test the similiarity of collocated gauges.