Skip to content

Introduction

What is station matching?

Imagine you have recieved rainfall, or any kind of gauge data for that matter, from two different sources, it is likely they may use a different system of ids, names and even varying geo-location data (e.g., co-ordinate system, granularity).

Station matching is the process of pairing gauges from both networks using a combination of geo-spatial and string based metadata comparison.

What is the process of station matching?

All gauges in both networks are compared 1:1, both for distance and the similarity of sub-strings within the gauge names. The matches are then given a score and categorised as one of the following:

  • Nothing - no matching characteristic, pair is ignored
  • Accepted - high quality pairing(s), no better pairing exists
  • Rank-Rejected - high quality pairing(s), better pairing exists
  • Auto-Rejected - low quality pairings(s), no better pairing exists

What is the outcome of the station matching process?

Depending on which pipeline a user runs, they will either create 3 dataframes (accepted, rank-rejected, auto-rejected) of matches, or a final output dataframe.

The final output dataframes follows a manual-review script or notebook which is automatically generated for the user, allowing them to compare rank-rejected and auto-rejected matches to those automatically accepted, with a focus on a minimal amount of required user-input.

The user notebook creates a set of one-to-one or one-to-many pairings where they have the option to assign main and back-up gauges for those one-to-many pairings.

The process chooses to not allow many-to-many matches as this would imply there are identical stations in one or both sets of gauge metadata. Pairs of stations both with a primary and back-up gauge should ideally be matched on a one-to-one basis; main to main, back-up to back-up.

What is station comparison?

Station comparison is a deeper dive on matched gauges, it provides a basic summary analysis of the similarity between both timeseries, including:

  • Overlap - total length, number of timesteps (all and non-NaN)
  • Accumulation - total, difference (absolute and percentage)
  • Statistcal summaries - R² (all and non-zero), Spearmans Rank (all and non-zero)

This can either be used to assess the quality of generated pairings, or to quickly test the similiarity of collocated gauges.