SBB Network Analysis - Part 1

Networks

Author

Martin Sterchi

Published

February 28, 2025

Update 10.03.2025: I updated the analysis in this blog so that it runs on more recent data. More precisely, I use the train traffic data from March 5, 2025 to construct the network. Moreover, I now properly reference the source data and I have added a bunch of additional node attributes. The most interesting new node attributes are average passenger frequency data for all stations.

For quite some time I have been wondering if there are some interesting Swiss data that would serve as the basis for some fun network analysis. As a fan of public transportation and a long-time owner of a Swiss train pass (“GA”), the answer should have been obvious much sooner: the Swiss railway network.

I wanted to create a (static) network in which each node corresponds to a train station and each directed edge between any two nodes, A and B, means there is at least one train going nonstop from A to B. Ideally, the edge would also be attributed with some weight representing the importance of the edge (e.g., how many trains go nonstop from A to B on a given day).

The structure of this post is as follows. I will first introduce the three datasets that I used to create the network. I will then show how to load and preprocess each one of them and how to join them. Finally, I will present how to transform those data into a form that is suitable for network analysis. The following image shows a visualization of the network data resulting from this post.

The Swiss railway network with a geographic layout (created using Gephi).

This is the first part of a series that will cover all kinds of fun network analysis based on the Swiss railway network.

Data sources

It was not that obvious how a network with nodes and edges following the definitions given above could be constructed based on data from the Swiss Federal Railways (abbreviated by the German speakers in Switzerland as SBB). With some help from SBB Experts and the Open Data Plattform Mobility Switzerland, I finally found the right data.

The first and most important dataset is called Ist-Daten and, for a given day, contains all regular stops of all trains in Switzerland with their planned and effective arrival and departure times. From this data, we can infer all nonstop stretches of any train in Switzerland. A description of this dataset can be found here.

Note that the “Ist-Daten” not only contain the data for trains but also for all other public transport (buses, trams, and even boats). To keep things simple we will focus on the train network.

The second dataset is the Dienststellen-Daten which basically allows to add node attributes such as the geographic coordinates of a node (i.e., a train station). A description of this dataset can be found here.

The third dataset is a statistic of the average number of passengers boarding and alighting. It will allow us to add further interesting node attributes.

Load and preprocess “Ist-Daten”

Here, we will load and preprocess the “Ist-Daten” from which we can derive the edges of our network. First, I import some Python libraries and print their version number for better reproducibility of this code.

import pandas as pd
import numpy as np
from collections import Counter

# Check versions of libraries.
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)

# Make sure there is no limit on the number of columns shown.
pd.set_option('display.max_columns', None)

NumPy version: 1.26.4
Pandas version: 2.1.4

Let’s now load the data. You can see in the filename that I downloaded the “Ist-Daten” from the SBB portal for March 5, 2025. You can get the data for any day you want here.

# Load the data
df = pd.read_csv('2025-03-05_istdaten.csv', sep=";", low_memory=False)

To get a feeling for the data, let’s check the number of rows and columns.

# Number of rows and columns
print(df.shape)

(2510290, 21)

Ok, it’s actually a pretty big dataset: it has over 2.5 million rows. That makes sense as this file contains every stop of every vehicle involved in public transport on a given day. Thus, every row corresponds to a stop of a train, bus, or any other vehicle of public transport.

# Missing values per column
df.isna().sum()

BETRIEBSTAG                  0
FAHRT_BEZEICHNER             0
BETREIBER_ID                 0
BETREIBER_ABK                0
BETREIBER_NAME               0
PRODUKT_ID                  31
LINIEN_ID                    0
LINIEN_TEXT                  0
UMLAUF_ID              1362009
VERKEHRSMITTEL_TEXT          0
ZUSATZFAHRT_TF               0
FAELLT_AUS_TF                0
BPUIC                        0
HALTESTELLEN_NAME       169114
ANKUNFTSZEIT            149415
AN_PROGNOSE             156828
AN_PROGNOSE_STATUS      149200
ABFAHRTSZEIT            149426
AB_PROGNOSE             157273
AB_PROGNOSE_STATUS      149115
DURCHFAHRT_TF                0
dtype: int64

We can see that some columns contain many missing values. The only one I worry about for now is the column PRODUKT_ID. If you look through these rows (I don’t show that here), you can see that they should all be of type “Zug” (train). Thus, we impute accordingly:

# Impute 'Zug'
df.loc[df["PRODUKT_ID"].isna(), "PRODUKT_ID"] = 'Zug'

There are quite a few date-timestamp columns that are not yet in the proper format. Thus, we now convert them to datetime formats:

# Convert BETRIEBSTAG to date format
df['BETRIEBSTAG'] = pd.to_datetime(df['BETRIEBSTAG'], format = "%d.%m.%Y")

# Convert ANKUNFTSZEIT, AN_PROGNOSE, ABFAHRTSZEIT, AB_PROGNOSE to datetime format
df['ANKUNFTSZEIT'] = pd.to_datetime(df['ANKUNFTSZEIT'], format = "%d.%m.%Y %H:%M")
df['AN_PROGNOSE'] = pd.to_datetime(df['AN_PROGNOSE'], format = "%d.%m.%Y %H:%M:%S")
df['ABFAHRTSZEIT'] = pd.to_datetime(df['ABFAHRTSZEIT'], format = "%d.%m.%Y %H:%M")
df['AB_PROGNOSE'] = pd.to_datetime(df['AB_PROGNOSE'], format = "%d.%m.%Y %H:%M:%S")

Now is a good time to finally have a look at the dataframe:

# Let's look at first few rows
df.head()

	BETRIEBSTAG	FAHRT_BEZEICHNER	BETREIBER_ID	BETREIBER_ABK	BETREIBER_NAME	PRODUKT_ID	LINIEN_ID	LINIEN_TEXT	UMLAUF_ID	VERKEHRSMITTEL_TEXT	ZUSATZFAHRT_TF	FAELLT_AUS_TF	BPUIC	HALTESTELLEN_NAME	ANKUNFTSZEIT	AN_PROGNOSE	AN_PROGNOSE_STATUS	ABFAHRTSZEIT	AB_PROGNOSE	AB_PROGNOSE_STATUS	DURCHFAHRT_TF
0	2025-03-05	80:800631:17230:000	80:800631	DB	DB Regio AG Baden-Württemberg	Zug	17230	RB	NaN	RB	False	False	8500090	Basel Bad Bf	2025-03-05 04:59:00	2025-03-05 04:59:00	PROGNOSE	NaT	NaT	NaN	False
1	2025-03-05	80:800631:17233:000	80:800631	DB	DB Regio AG Baden-Württemberg	Zug	17233	RB	NaN	RB	False	False	8500090	Basel Bad Bf	NaT	NaT	NaN	2025-03-05 06:07:00	2025-03-05 06:08:00	PROGNOSE	False
2	2025-03-05	80:800631:17234:000	80:800631	DB	DB Regio AG Baden-Württemberg	Zug	17234	RB	NaN	RB	False	False	8500090	Basel Bad Bf	2025-03-05 05:56:00	2025-03-05 06:02:00	PROGNOSE	NaT	NaT	NaN	False
3	2025-03-05	80:800631:17235:000	80:800631	DB	DB Regio AG Baden-Württemberg	Zug	17235	RB	NaN	RB	False	False	8500090	Basel Bad Bf	NaT	NaT	NaN	2025-03-05 06:43:00	2025-03-05 06:53:00	PROGNOSE	False
4	2025-03-05	80:800631:17236:000	80:800631	DB	DB Regio AG Baden-Württemberg	Zug	17236	RB	NaN	RB	False	False	8500090	Basel Bad Bf	2025-03-05 06:31:00	2025-03-05 06:34:00	PROGNOSE	NaT	NaT	NaN	False

But what do all these columns mean? I have browsed the metadata a bit and found the following explanations (that I hopefully accurately reproduce in English):

BETRIEBSTAG: Simply the day on which the data were recorded.
FAHRT_BEZEICHNER: This is some elaborate identifier in the format [UIC-Countrycode]:[GO-Number]:[VM-Number]:[Extended Reference].
BETREIBER_ID: [UIC-Countrycode]:[GO-Number]. GO is short for “Geschäftsorganisation”. For foreign organizations it is not a GO-Number but a TU-Number with TU meaning “Transportunternehmen”. It is basically an ID for the company running that particular train.
BETREIBER_ABK: The abbreviation for the company running the train.
BETREIBER_NAME: The full name of the company running the train.
PRODUKT_ID: Type of public transport.
LINIEN_ID: The ID for the route of that train.
LINIEN_TEXT: The public ID for the route of that train.
UMLAUF_ID: An ID for a “Umlauf” which describes the period starting with the vehicle leaving the garage and ending with the vehicle being deposited back in the garage.
ZUSATZFAHRT_TF: Is true if it is an extraordinary (not usually scheduled) trip.
FAELLT_AUS_TF: Is true if the trip is cancelled.
BPUIC: The ID of the station.
HALTESTELLEN_NAME: The name of the station.
ANKUNFTSZEIT: Planned time of arrival at the station.
AN_PROGNOSE: Prediction of time of arrival at the station.
AN_PROGNOSE_STATUS: Status of that prediction. Possible values are: “UNBEKANNT”, “leer”, “PROGNOSE”, “GESCHAETZT”, “REAL”. If the value of that column is “REAL”, it means that the predicted time of arrival is the time the train actually arrived at the station.
ABFAHRTSZEIT, AB_PROGNOSE, AB_PROGNOSE_STATUS: Same definitions as for arrival but here for departure from the station.
DURCHFAHRT_TF: Is true if the vehicle does not stop even if a stop was scheduled.

Let’s now have a look at the values in the column PRODUKT_ID:

# Look at PRODUKT_ID
df["PRODUKT_ID"].value_counts()

PRODUKT_ID
Bus            1965207
Tram            249408
Zug             163649
BUS             124171
Metro             4936
Zahnradbahn       1944
Schiff             975
Name: count, dtype: int64

We can see that trains are only the third most frequent category in this data. However, as mentioned before, we want to keep it simple and now reduce the dataset to only trains.

# First we reduce to only trains
df = df[df['PRODUKT_ID'] == "Zug"]

In a next step, we remove all rows where the corresponding train has been cancelled.

# Filter out all entries with FAELLT_AUS_TF == True
df = df[df['FAELLT_AUS_TF'] == False]

When I was doing some analysis with a first version of that network, I noticed that it contains edges in both directions between Klosters Selfranga and Sagliains. Upon further inspection I found out that this corresponds to a car shuttle train. All such connections are marked with “ATZ” in the variable LINIEN_TEXT which probably stands for “Autozug”. We remove these connections:

# Filter out all entries with LINIEN_TEXT == "ATZ"
df = df[df['LINIEN_TEXT'] != "ATZ"]

Another problem I spotted was that certain stations are split up into two based on which company is running what part of the station. I actually computed geodesic distances between any pair of stations to identify the problematic ones. I decided to manually change three cases. In the following code chunk you can see that I merge Brig and Brig Bahnhofplatz, Lugano and Lugano FLP, as well as Locarno and Locarno FART.

# Merge stations in Brig, Lugano, Locarno
df.loc[df['HALTESTELLEN_NAME'] == "Brig Bahnhofplatz", "BPUIC"] = 8501609
df.loc[df['HALTESTELLEN_NAME'] == "Lugano FLP", "BPUIC"] = 8505300
df.loc[df['HALTESTELLEN_NAME'] == "Locarno FART", "BPUIC"] = 8505400

df.loc[df['HALTESTELLEN_NAME'] == "Brig Bahnhofplatz", "HALTESTELLEN_NAME"] = "Brig"
df.loc[df['HALTESTELLEN_NAME'] == "Lugano FLP", "HALTESTELLEN_NAME"] = "Lugano"
df.loc[df['HALTESTELLEN_NAME'] == "Locarno FART", "HALTESTELLEN_NAME"] = "Locarno"

Let’s explore the data a bit more before we move to the second dataset. Let’s check out the most frequent values that occur in the column BETREIBER_NAME:

# Look at BETREIBER_NAME
df["BETREIBER_NAME"].value_counts().head()

BETREIBER_NAME
Schweizerische Bundesbahnen SBB    63850
BLS AG (bls)                       16256
THURBO                             13017
Aargau Verkehr AG                   7131
Schweizerische Südostbahn (sob)     6083
Name: count, dtype: int64

As expected, SBB is the company serving the largest number of stations. What about the column VERKEHRSMITTEL_TEXT?

# Look at VERKEHRSMITTEL_TEXT
df["VERKEHRSMITTEL_TEXT"].value_counts().head()

VERKEHRSMITTEL_TEXT
S     103331
R      34435
RE      9730
IR      7532
IC      3059
Name: count, dtype: int64

We can see that the most frequent type of trains are S-Bahns (S). Finally, let’s check the most frequent train stations that occur in the data:

# Look at HALTESTELLEN_NAME
df["HALTESTELLEN_NAME"].value_counts().head()

HALTESTELLEN_NAME
Zürich HB          2239
Bern               1709
Winterthur          955
Zürich Oerlikon     919
Luzern              843
Name: count, dtype: int64

Unsurprisingly, Zürich and Bern are the most frequent values occuring in the data.

Load and preprocess “Dienststellen-Daten”

Fortunately, we can go through the second dataset a bit more quickly. We again start by loading it and checking the dimensions of the dataframe.

# Load the data
ds = pd.read_csv('actual_date-swiss-only-service_point-2025-03-06.csv', sep = ";", low_memory = False)

# Number of rows and columns
print(ds.shape)

(55308, 55)

The data contains a column validTo that allows us to filter out all stations that are not valid anymore (closed down?). We check the values that appear in this column and see that all stations should be valid as of March 6, 2025. This is no surprise as we use the dataset of currently valid stations.

# Check 'validTo' values.
ds['validTo'].unique()

array(['9999-12-31', '2025-12-13', '2025-04-06', '2026-12-12',
       '2025-08-29', '2027-12-11', '2025-03-16', '2025-05-30',
       '2099-12-31', '2028-12-09', '2025-09-30', '2030-12-14',
       '2025-06-30', '2050-12-31', '2029-12-09', '2025-08-31',
       '2025-03-31', '2025-04-12', '2025-05-16', '2025-03-06'],
      dtype=object)

Let’s also quickly make sure that we have unique rows (based on ‘number’).

# Is the number of unique 'number' (= BPUIC) values equal to the number of rows?
len(pd.unique(ds['number'])) == ds.shape[0]

True

Finally, we keep only the columns we need (identifier, official name, and geo coordinates).

# Keep only the relevant columns
ds = ds[["number","designationOfficial","cantonName","municipalityName","businessOrganisationDescriptionEn","wgs84East","wgs84North","height"]]

# Show first few rows
ds.head()

	number	designationOfficial	cantonName	municipalityName	businessOrganisationDescriptionEn	wgs84East	wgs84North
0	1322001	Antronapiana	NaN	NaN	Autoservizi Comazzi S.R.L.	8.113620	46.060120
1	1322002	Anzola d'Ossola	NaN	NaN	Autoservizi Comazzi S.R.L.	8.345715	45.989869
2	1322003	Baceno	NaN	NaN	Autoservizi Comazzi S.R.L.	8.319256	46.261501
3	1322012	Castiglione	NaN	NaN	Autoservizi Comazzi S.R.L.	8.214886	46.020588
4	1322013	Ceppo Morelli	NaN	NaN	Autoservizi Comazzi S.R.L.	8.069922	45.971036

Load and preprocess average traffic data

This part is also fairly easy. We load the data and check the dimensions, as always.

# Load the data
ds_freq = pd.read_csv('t01x-sbb-cff-ffs-frequentia-2023.csv', sep = ";", low_memory = False)

# Number of rows and columns
print(ds_freq.shape)

(3479, 14)

If you actually have a look at the data, you see that many stations have several measurements made at different times (and the times of measurements are identified by Jahr_Annee_Anno). We only want to keep the most recent measurements for every station:

# For every station, we only keep the most recent measurements.
ds_freq = ds_freq.loc[ds_freq.groupby('UIC')['Jahr_Annee_Anno'].idxmax()]

Checking the data types of all columns reveals that there is still a problem with the measurement columns DTV_TJM_TGM, DWV_TMJO_TFM, and DNWV_TMJNO_TMGNL. They are currently of type object because they contain the thousand separator ’. We thus remove all instances of this characters and transform these columns to integers.

# Data types of columns
ds_freq.dtypes

# Remove thousand separator and make integers out of it.
ds_freq['DTV_TJM_TGM'] = ds_freq['DTV_TJM_TGM'].str.replace('’', '').astype(int)
ds_freq['DWV_TMJO_TFM'] = ds_freq['DWV_TMJO_TFM'].str.replace('’', '').astype(int)
ds_freq['DNWV_TMJNO_TMGNL'] = ds_freq['DNWV_TMJNO_TMGNL'].str.replace('’', '').astype(int)

Finally, we keep only the relevant columns.

# Keep only the relevant columns
ds_freq = ds_freq[["UIC","DTV_TJM_TGM","DWV_TMJO_TFM","DNWV_TMJNO_TMGNL"]]

# Show first few rows
ds_freq.head()

	UIC	DTV_TJM_TGM	DWV_TMJO_TFM	DNWV_TMJNO_TMGNL
411	8500010	98600	105900	81900
423	8500016	90	100	60
2024	8500020	5700	7000	2800
2294	8500021	8500	9900	5200
1072	8500022	3600	4100	2300

But what exactly are these three measurement variables? The source dataset provides the following definitions:

DTV_TJM_TGM: “Average daily traffic (Monday to Sunday).”
DWV_TMJO_TFM: “Average traffic on weekdays (Monday to Friday).”
DNWV_TMJNO_TMGNL: “Average non-work day traffic (Saturdays, Sundays and public holidays).”

It is further mentioned that all passengers boarding and exiting the trains are counted. That also means that passengers who switch trains are counted twice. For larger stations, the data may not cover all trains arriving and departing at the corresponding station. For example, the numbers for Bern do not include the traffic generated by the regional train company RBS.

Combine the three datasets

We first merge the traffic data to the “Dienststellen-Daten”:

# Join to 'ds'
ds = pd.merge(ds, ds_freq, left_on = 'number', right_on = 'UIC', how = 'left')

# Drop 'UIC'
ds = ds.drop('UIC', axis=1)

# Better column names
ds.columns = ['BPUIC','STATION_NAME','CANTON','MUNICIPALITY','COMPANY',
              'LONGITUDE','LATITUDE','ELEVATION','AVG_DAILY_TRAFFIC',
              'AVG_DAILY_TRAFFIC_WEEKDAYS','AVG_DAILY_TRAFFIC_WEEKENDS']

Then we merge the “Dienststellen-Daten” to the “Ist-Daten” via the BPUIC variable:

# Left-join with station names and coordinates
df = pd.merge(df, ds, on = 'BPUIC', how = 'left')

Unfortunately, there are some rows (18) for which HALTESTELLEN_NAME is missing. But fortunately, we know which stations are affected based on the STATION_NAME column that we have just merged from ds.

# There are 18 missing values for 'HALTESTELLEN_NAME' which we impute from 'STATION_NAME'.
df.loc[df['HALTESTELLEN_NAME'].isna(), "HALTESTELLEN_NAME"] = df.loc[df['HALTESTELLEN_NAME'].isna(), "STATION_NAME"]

Now, we are finally ready to start extracting the network from this data!

Convert it to a network

As I mentioned several times, every row corresponds to a stop of a train at a train station. One train ride from some initial station to some end station (called “Fahrt” in German) then typically consists of several stops along the way. However, there are some “Fahrten” with only one entry. Presumably these are mostly foreign trains that have their final destination at some border station. I decided to remove those entries:

# First group by FAHRT_BEZEICHNER and then filter out all groups with only one entry
# It's mostly trains that stop at a place at the border (I think)
df_filtered = df.groupby('FAHRT_BEZEICHNER').filter(lambda g: len(g) > 1)

# How many rows do we loose with that?
print(df.shape[0] - df_filtered.shape[0])

This preprocessing step removes 420 rows.

Now we group the rows by FAHRT_BEZEICHNER so that each group is one “Fahrt”. In every group we sort the stops along the way in an ascending order of the departure time.

# Function to sort entries within a group in ascending order of ABFAHRTSZEIT
def sort_data(group):
    return group.sort_values('ABFAHRTSZEIT', ascending = True)

# Sort for each group
df_sorted = df_filtered.groupby('FAHRT_BEZEICHNER', group_keys=True).apply(sort_data)

Let’s have a look at one “Fahrt” to get a better idea:

# Look at one example Fahrt
df_sorted.loc[['85:22:1083:000'],['BETREIBER_NAME','LINIEN_TEXT','HALTESTELLEN_NAME','ABFAHRTSZEIT']]

		BETREIBER_NAME	LINIEN_TEXT	HALTESTELLEN_NAME	ABFAHRTSZEIT
FAHRT_BEZEICHNER
85:22:1083:000	64346	Appenzeller Bahnen (ab)	S23	Gossau SG	2025-03-05 08:21:00
	64347	Appenzeller Bahnen (ab)	S23	Herisau	2025-03-05 08:28:00
	64348	Appenzeller Bahnen (ab)	S23	Herisau Wilen	2025-03-05 08:30:00
	64349	Appenzeller Bahnen (ab)	S23	Waldstatt	2025-03-05 08:34:00
	64350	Appenzeller Bahnen (ab)	S23	Zürchersmühle	2025-03-05 08:39:00
	64351	Appenzeller Bahnen (ab)	S23	Urnäsch	2025-03-05 08:43:00
	64352	Appenzeller Bahnen (ab)	S23	Jakobsbad	2025-03-05 08:48:00
	64353	Appenzeller Bahnen (ab)	S23	Gonten	2025-03-05 08:50:00
	64354	Appenzeller Bahnen (ab)	S23	Gontenbad	2025-03-05 08:52:00
	64355	Appenzeller Bahnen (ab)	S23	Appenzell	NaT

This is a train that goes from Gossau to Appenzell with many stops in-between. In Appenzell the ABFAHRTSZEIT is missing as that “Fahrt” ends there (the train will most likely go back in the other direction, but that will be a new “Fahrt”).

We now have enough knowledge about the data that we can extract the edges in a for loop. Basically, what we do is to loop over the rows of a given “Fahrt”, starting with the second row and extracting the edges as

(previous station, current station, travel time between stations).

The Python code for this looks as follows:

# Empty list
edgelist = []

# Variables to store previous row and its index
prev_row = None
prev_idx = None

# Loop over rows of dataframe
for i, row in df_sorted.iterrows():
    # Only start with second row
    # Only if the two rows belong to the same Fahrt
    if prev_idx is not None and prev_idx == i[0]:
        # Add edge to edgelist assuming it's a directed edge
        edgelist.append((prev_row['STATION_NAME'], 
                         row['STATION_NAME'], 
                         (row['ANKUNFTSZEIT'] - prev_row['ABFAHRTSZEIT']).total_seconds() / 60))
    # Set current row and row index to previous ones
    prev_idx = i[0]
    prev_row = row

To get a better idea, let’s have a look at the first list element:

# First list element
edgelist[0]

('Zürich HB', 'Basel SBB', 54.0)

We are still not quite done yet. The problem is that the edgelist contains many duplicated entries as, for example, the stretch Zürich HB - Basel SBB is served by many different trains on a given day.

What we want to do is to go through all possible edges and sum up the number of times they occur. In addition, we would like to average the travel time between a given pair of stations over all trips between the two stations. The following code does exactly that and saves the result in the form of a dictionary.

# Empty dict
edges = {}

# Loop over elements in edgelist
for i in edgelist:
    # Create key
    key = (i[0], i[1])
    # Get previous entries in dict (if there are any)
    prev = edges.get(key, (0, 0))
    # Update values in dict
    edges[key] = (prev[0] + 1, prev[1] + i[2])

# Divide summed up travel times by number of trips
edges = {k: (v[0], round(v[1]/v[0], 2)) for k, v in edges.items()}

Let’s look at the entry for the stretch between Zürich and Basel again:

# Look at some element in dict
edges[('Zürich HB', 'Basel SBB')]

(36, 54.0)

There are 36 trips between these two stations (in this direction) and they take 54 minutes on average.

Another issue that I spotted when I was trying to visualize this network was two edges between Basel Bad Bf and Schaffhausen. When I consulted the SBB timetable I saw that these supposedly nonstop connections actually stop at quite a few stations in Germany. But because these stops are in Germany they do not appear in the data. As a conclusion of all this, I decided to remove these two edges:

# Remove the two edges between Basel Bad Bf and Schaffhausen
del edges[('Basel Bad Bf', 'Schaffhausen')]
del edges[('Schaffhausen', 'Basel Bad Bf')]

We are now ready to create the final node list (and export it). First, we reduce ds to the train stations that actually appear in the edges (it still contains many bus and tram stops and other things).

# Set of stations that appear in edgelist
stations_in_edgelist = set(sum(list(edges.keys()), ()))

# Reduces nodes dataframe to only places in edgelist
nodes = ds[ds['STATION_NAME'].isin(stations_in_edgelist)]

Second, we quickly check the number of missing values again.

# Missing values per column
nodes.isna().sum()

BPUIC                           0
STATION_NAME                    0
CANTON                         21
MUNICIPALITY                   21
COMPANY                         0
LONGITUDE                       0
LATITUDE                        0
ELEVATION                       1
AVG_DAILY_TRAFFIC             496
AVG_DAILY_TRAFFIC_WEEKDAYS    496
AVG_DAILY_TRAFFIC_WEEKENDS    496
dtype: int64

There are still some issues here. The one we can solve is the missing elevation. The station Tirano (in Italy) has no value for this column. We simply impute manually (Tirano’s elevation is approximately 441m).

# Impute missing elevation for Tirano
nodes.loc[nodes['STATION_NAME'] == "Tirano", "ELEVATION"] = 441

The missing values for CANTON and MUNICIPALITY concern municipalities abroad (in Germany and Italy mostly). The 500 missing values in the traffic columns are stations are run by smaller companies or stations abroad. There is nothing we can do about all these missing values.

# Have a look
nodes.head()

# Export node list
# nodes.sort_values("BPUIC").to_csv("nodelist.csv", sep = ';', encoding = 'utf-8', index = False)

	BPUIC	STATION_NAME	CANTON	MUNICIPALITY	COMPANY	LONGITUDE	LATITUDE	ELEVATION	AVG_DAILY_TRAFFIC	AVG_DAILY_TRAFFIC_WEEKDAYS	AVG_DAILY_TRAFFIC_WEEKENDS
12683	8500100	Tavannes	Bern	Tavannes	Swiss Federal Railways SBB	7.201645	47.219845	754.17	1400.0	1600.0	810.0
12684	8500121	Courfaivre	Jura	Haute-Sorne	Swiss Federal Railways SBB	7.291166	47.335083	450.99	420.0	480.0	280.0
12685	8500103	Sorvilier	Bern	Sorvilier	Swiss Federal Railways SBB	7.305794	47.239354	681.07	60.0	70.0	49.0
12688	8500120	Courtételle	Jura	Courtételle	Swiss Federal Railways SBB	7.317943	47.342829	436.90	840.0	970.0	550.0
12689	8500102	Malleray-Bévilard	Bern	Valbirse	Swiss Federal Railways SBB	7.275946	47.238714	698.18	630.0	780.0	280.0

Before we export the edges, we change the station names in the edgelist to the BPUIC to make the edges more compact. Then we transform the dictionary into a dataframe which can finally be exported.

# Create a node dict with BPUIC as values
node_dict = dict(zip(nodes.STATION_NAME, nodes.BPUIC))

# Transform edge dict to nested list and replace all station names with their BPUIC
edges = [[node_dict[k[0]], node_dict[k[1]], v[0], v[1]] for k,v in edges.items()]

# Create a dataframe
edges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2','NUM_CONNECTIONS','AVG_DURATION'])

# Have a look
edges.head()

# Export edge list
# edges.to_csv("edgelist_SoSto.csv", sep = ';', encoding = 'utf-8', index = False)

	BPUIC1	BPUIC2	NUM_CONNECTIONS	AVG_DURATION
0	8503000	8500010	36	54.00
1	8500010	8500090	67	6.07
2	8500090	8500010	67	6.39
3	8500010	8503000	39	57.87
4	8506286	8506271	34	3.00

Feel free to download the final results: Nodelist (CSV) and Edgelist (CSV).

The title image has been created by Wikimedia user JoachimKohler-HB and is licensed under Creative Commons.