Station Matching¶

1. Why might you need to do station matching?¶

Let's imagine you have two lists of rain gauge metadata that are different but are for same underlying rain gauge sites (e.g. Tables 1 & 2). They might be different because the station lists were:

collected at different times, i.e. older and newer
collected by different instituions who use their own unique IDs
collected with different geographic projections (and now have rounding errors)

RainGaugeMatcha will help you match the station lists.

Table 1. List of rain gauge stations from source 1

Table 2. List of rain gauge stations from source 2

2. Running the station matching algorithm¶

To run the station matching algorithm, your two rain gauge lists will need to have at least these 4 columns: "station_id", "station_name", "easting", "northing". Then you can load in and run the algorithm like this:

Python

import pandas as pd
from raingaugematcha import run_matching_algorithm

station_list_1 = pd.read_csv("path/to/station_list_1.csv")
station_list_2 = pd.read_csv("path/to/station_list_2.csv")

matches, accepted_df, rank_rejected_df, auto_rejected_df = (
    run_matching_algorithm(
        station_list_1,
        station_list_2
    )
)

Saving outputs

If you would like to save the outputs of the algorithm to csv, add save_outputs_to_csv=True to run_matching_algorithm(...) and select a output location with output_dir="outputs/"

2.1 Intepreting the matches¶

A reminder of the types of matches provided by RainGaugeMatcha:

Nothing - no matching characteristic, pair is ignored
Accepted - high quality pairing(s), no better pairing exists
Rank-Rejected - high quality pairing(s), better pairing exists
Auto-Rejected - low quality pairings(s), no better pairing exists

Here is what they make look like:

Accepted matches:¶

Table 3. Accepted matches

Rank-rejected matches:¶

Table 4. Rank-rejected matches

Auto-rejected matches:¶

Table 5. Auto-rejected matches

It is now up to the user to interpret and manually check the automatically produced matches (discussed in #3)

Running manual matching

If you would like to save the outputs of the algorithm to csv, add save_outputs_to_csv=True to run_matching_algorithm(...) and select a output location with output_dir="outputs/"

3. Manual station matching¶

RainGaugeMatcha generates files needed for the manual station matching procedure. In this step of the station matching, you will be able to override matches from the automatic station matching (see 2.). You can generate a python script and/or a notebook, let's generate both below:

Python

import pandas as pd
from raingaugematcha import run_matching_algorithm

station_list_1 = pd.read_csv("path/to/station_list_1.csv")
station_list_2 = pd.read_csv("path/to/station_list_2.csv")

matches, accepted_df, rank_rejected_df, auto_rejected_df = (
    run_matching_algorithm(
        station_list_1,
        station_list_2,
        output_dir='matching_outputs',
        save_outputs_to_csv=True,
        save_manual_matching_script=True,
        save_manual_matching_notebook=True,
    )
)

Overwriting output files

By default run_matching_algorithm(...) will not overwrite any files if you are saving outputs, you can use overwrite_existing=True to overwrite files in output directory

The above code will create two files under matching_outputs/:

manual_station_matching_wo_backups.py
manual_station_matching.ipynb

Each of which provide user input prompts running through a 3-step procedure:

Check rank-rejected matches
Check auto-rejected matches
Check for duplicates in matches

Running station matching without generating files

If you would like to run the station matching and run the manual matching without generating or saving files, please use the procedure from this notebook: src/raingaugematcha/core/manual_station_matching.ipynb.

3.1 Checking rank-rejected matches¶

To begin the manual checking of the station matches, you can use the generated CSVs for accepted and auto rejected matches:

Python

  import pandas as pd
  from raingaugematcha import manually_process_matches

  # generated by run_station_matching(...)
  accepted_matches = pd.read_csv('accepted.csv')
  rank_rejected_matches = pd.read_csv('rank_rejected.csv')

  # Check rank rejected matches
  accepted_matches_to_replace, rejected_matches_to_accept, accepted_matches_to_remove = (
    manually_process_matches.manually_process_rank_rejected_matches(
      accepted_matches,
      rank_rejected_matches
    )
  )

After you run this, you will get a prompt such as shown below:

rank-rejected manual matching — Prompt for manually checking rank-rejected matches

Making mistakes when running through the prompts

If you make a mistake whilst responding the prompts, simply interupt the process and run that line again

3.1.1 Updating accepted matches¶

After you run the user prompts, you can replace and overwrite the matches using these three lines:

Python

  # 1. Remove accepted matches manually rejected
  accepted_matches = (
    manually_process_matches.remove_matches_by_indices(
      accepted_matches,
      idx_to_remove=accepted_matches_to_remove
      )
    )

  # 2. Replace accepted matches with previously rank-rejected
  accepted_matches = (
    manually_process_matches.replace_match_with_new_match(
      accepted_matches,
      old_matches_to_replace=accepted_matches_to_replace
      )
    )

  # 2. Add new matches previously rank-rejected
  accepted_matches = (
    manually_process_matches.add_new_rows_to_match_df(
      accepted_matches,
      new_matches_to_add=rejected_matches_to_accept
      )
    )

3.2 Checking auto-rejected matches¶

To begin the manual checking of the station matches, you can use the generated CSVs for accepted and auto rejected matches:

Python

  import pandas as pd
  from raingaugematcha import manually_process_matches

  # generated by run_station_matching(...)
  accepted_matches = pd.read_csv('accepted.csv')
  auto_rejected_matches = pd.read_csv('auto_rejected.csv')

  # Check auto rejected matches
  accepted_matches_to_replace, rejected_matches_to_accept, accepted_matches_to_remove = (
    manually_process_matches.manually_process_auto_rejected_matches(
      accepted_matches,
      auto_rejected_matches
    )
  )

The prompts will look like this:

auto-rejected manual matching — Prompt for manually checking auto-rejected matches

Making mistakes when running through the prompts

If you make a mistake whilst responding the prompts, simply interupt the process and run that line again

3.2.1 Updating accepted matches¶

Similar to 3.1.1., you will need to update the accepted matches after running the manual checks of auto-rejected matches:

Python

  # 1. Remove accepted matches manually rejected
  accepted_matches = (
    manually_process_matches.remove_matches_by_indices(
      accepted_matches,
      idx_to_remove=accepted_matches_to_remove
      )
    )

  # 2. Add new matches previously rank-rejected
  accepted_matches = (
    manually_process_matches.add_new_rows_to_match_df(
      accepted_matches,
      new_matches_to_add=rejected_matches_to_accept
      )
    )

3.3 Checking for duplicates in matches¶

Finally, it is important you check for duplicates in the matches last. The previous two steps may generate these duplicates that will then need to be cleaned up. Let's assume that we do not want any backups so we are only looking for one station from station_list_1 to match with one station from station_list_2:

Python

  import pandas as pd
  from raingaugematcha import manually_process_matches

  # final_accepted_matches ==  the output of steps 3.1 & 3.2 above
  many_to_one_left_stations = manually_process_matches.get_matches_with_duplicates(final_accepted_matches, match_col="station_left.name")
  many_to_one_right_stations = manually_process_matches.get_matches_with_duplicates(final_accepted_matches, match_col="station_right.name")

  # Check duplicates one left to many right
  accepted_matches_to_remove = (
    manually_process_matches.manually_process_duplicate_matches(
      final_accepted_matches,
      duplicate_matches=many_to_one_left_stations,
      duplicate_matching_col="station_right.name",
      create_backups=False
    )
  )

  # Check duplicates one left to many right
  accepted_matches_to_remove = (
    manually_process_matches.manually_process_duplicate_matches(
      final_accepted_matches,
      duplicate_matches=many_to_one_right_stations,
      duplicate_matching_col="station_right.name",
      create_backups=False
    )
  )

The prompt will then look like this:

duplicate manual matching — Prompt for manually checking duplicate matches

And you can remove those duplicate matches with:

Python

  final_accepted_matches = (
    manually_process_matches.remove_matches_by_indices(
      final_accepted_matches,
      idx_to_remove=accepted_matches_to_remove
      )
  )

3.4 Allowing backup matches¶

Sometimes rain gauges have backups. Let's assume that we want to allow for backups and that we are looking for one station from station_list_1 (left) to match with many station from station_list_2 (right). Setting allow_backups=True will allow you to accept backups to a given station. Scripts for allowing backups can be generated like:

Python

from raingaugematcha import run_matching_algorithm

matches, accepted_df, rank_rejected_df, auto_rejected_df = (
    run_matching_algorithm(
        station_list_1,
        station_list_2,
        output_dir='matching_outputs',
        save_outputs_to_csv=True,
        save_manual_matching_script=True,
        save_manual_matching_notebook=True,
        allow_backups=True,
    )
)

After running through and checking the rank- and auto-rejected matches, you will get to checking the duplicates (step 3.3), this will run a line similar to:

Python

```python

from raingaugematcha import manually_process_matches

# Allow for one left to many right
accepted_matches_to_remove, backup_station_indices = (
  manually_process_matches.manually_process_duplicate_matches(
    final_accepted_matches,
    duplicate_matches=many_to_one_left_stations,
    duplicate_matching_col="station_left.name",
    create_backups=True
    )
  )

# Disallow for one right to many left
accepted_matches_to_remove, backup_station_indices = (
  manually_process_matches.manually_process_duplicate_matches(
    final_accepted_matches,
    duplicate_matches=many_to_one_right_stations,
    duplicate_matching_col="station_right.name",
    create_backups=False
    )
  )

```

The prompt for creating a backup will look like this:

Get in touch with the developers @ UKCEH via email if you have any questions...