Introduction¶

What is station matching?¶

Imagine you have recieved rainfall, or any kind of gauge data for that matter, from two different sources, it is likely they may use a different system of ids, names and even varying geo-location data (e.g., co-ordinate system, granularity).

Station matching is the process of pairing gauges from both networks using a combination of geo-spatial and string based metadata comparison.

What is the process of station matching?¶

All gauges in both networks are compared 1:1, both for distance and the similarity of sub-strings within the gauge names. The matches are then given a score and categorised as one of the following:

Nothing - no matching characteristic, pair is ignored
Accepted - high quality pairing(s), no better pairing exists
Rank-Rejected - high quality pairing(s), better pairing exists
Auto-Rejected - low quality pairings(s), no better pairing exists

What is the outcome of the station matching process?¶

Depending on which pipeline a user runs, they will either create 3 dataframes (accepted, rank-rejected, auto-rejected) of matches, or a final output dataframe.

The final output dataframes follows a manual-review script or notebook which is automatically generated for the user, allowing them to compare rank-rejected and auto-rejected matches to those automatically accepted, with a focus on a minimal amount of required user-input.

The user notebook creates a set of one-to-one or one-to-many pairings where they have the option to assign main and back-up gauges for those one-to-many pairings.

The process chooses to not allow many-to-many matches as this would imply there are identical stations in one or both sets of gauge metadata. Pairs of stations both with a primary and back-up gauge should ideally be matched on a one-to-one basis; main to main, back-up to back-up.

What is station comparison?¶

Station comparison is a deeper dive on matched gauges, it provides a basic summary analysis of the similarity between both timeseries, including:

Overlap - total length, number of timesteps (all and non-NaN)
Accumulation - total, difference (absolute and percentage)
Statistcal summaries - R² (all and non-zero), Spearmans Rank (all and non-zero)

This can either be used to assess the quality of generated pairings, or to quickly test the similiarity of collocated gauges.