SBB Network Analysis - Part 1

Networks
Author

Martin Sterchi

Published

February 28, 2025

Update 10.03.2025: I updated the analysis in this blog so that it runs on more recent data. More precisely, I use the train traffic data from March 5, 2025 to construct the network. Moreover, I now properly reference the source data and I have added a bunch of additional node attributes. The most interesting new node attributes are average passenger frequency data for all stations.

For quite some time I have been wondering if there are some interesting Swiss data that would serve as the basis for some fun network analysis. As a fan of public transportation and a long-time owner of a Swiss train pass (“GA”), the answer should have been obvious much sooner: the Swiss railway network.

I wanted to create a (static) network in which each node corresponds to a train station and each directed edge between any two nodes, A and B, means there is at least one train going nonstop from A to B. Ideally, the edge would also be attributed with some weight representing the importance of the edge (e.g., how many trains go nonstop from A to B on a given day).

The structure of this post is as follows. I will first introduce the three datasets that I used to create the network. I will then show how to load and preprocess each one of them and how to join them. Finally, I will present how to transform those data into a form that is suitable for network analysis. The following image shows a visualization of the network data resulting from this post.

The Swiss railway network with a geographic layout (created using Gephi).

This is the first part of a series that will cover all kinds of fun network analysis based on the Swiss railway network.

Data sources

It was not that obvious how a network with nodes and edges following the definitions given above could be constructed based on data from the Swiss Federal Railways (abbreviated by the German speakers in Switzerland as SBB). With some help from SBB Experts and the Open Data Plattform Mobility Switzerland, I finally found the right data.

The first and most important dataset is called Ist-Daten and, for a given day, contains all regular stops of all trains in Switzerland with their planned and effective arrival and departure times. From this data, we can infer all nonstop stretches of any train in Switzerland. A description of this dataset can be found here.

Note that the “Ist-Daten” not only contain the data for trains but also for all other public transport (buses, trams, and even boats). To keep things simple we will focus on the train network.

The second dataset is the Dienststellen-Daten which basically allows to add node attributes such as the geographic coordinates of a node (i.e., a train station). A description of this dataset can be found here.

The third dataset is a statistic of the average number of passengers boarding and alighting. It will allow us to add further interesting node attributes.

Load and preprocess “Ist-Daten”

Here, we will load and preprocess the “Ist-Daten” from which we can derive the edges of our network. First, I import some Python libraries and print their version number for better reproducibility of this code.

import pandas as pd
import numpy as np
from collections import Counter

# Check versions of libraries.
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)

# Make sure there is no limit on the number of columns shown.
pd.set_option('display.max_columns', None)
NumPy version: 1.26.4
Pandas version: 2.1.4

Let’s now load the data. You can see in the filename that I downloaded the “Ist-Daten” from the SBB portal for March 5, 2025. You can get the data for any day you want here.

# Load the data
df = pd.read_csv('2025-03-05_istdaten.csv', sep=";", low_memory=False)

To get a feeling for the data, let’s check the number of rows and columns.

# Number of rows and columns
print(df.shape)
(2510290, 21)

Ok, it’s actually a pretty big dataset: it has over 2.5 million rows. That makes sense as this file contains every stop of every vehicle involved in public transport on a given day. Thus, every row corresponds to a stop of a train, bus, or any other vehicle of public transport.

# Missing values per column
df.isna().sum()
BETRIEBSTAG                  0
FAHRT_BEZEICHNER             0
BETREIBER_ID                 0
BETREIBER_ABK                0
BETREIBER_NAME               0
PRODUKT_ID                  31
LINIEN_ID                    0
LINIEN_TEXT                  0
UMLAUF_ID              1362009
VERKEHRSMITTEL_TEXT          0
ZUSATZFAHRT_TF               0
FAELLT_AUS_TF                0
BPUIC                        0
HALTESTELLEN_NAME       169114
ANKUNFTSZEIT            149415
AN_PROGNOSE             156828
AN_PROGNOSE_STATUS      149200
ABFAHRTSZEIT            149426
AB_PROGNOSE             157273
AB_PROGNOSE_STATUS      149115
DURCHFAHRT_TF                0
dtype: int64

We can see that some columns contain many missing values. The only one I worry about for now is the column PRODUKT_ID. If you look through these rows (I don’t show that here), you can see that they should all be of type “Zug” (train). Thus, we impute accordingly:

# Impute 'Zug'
df.loc[df["PRODUKT_ID"].isna(), "PRODUKT_ID"] = 'Zug'

There are quite a few date-timestamp columns that are not yet in the proper format. Thus, we now convert them to datetime formats:

# Convert BETRIEBSTAG to date format
df['BETRIEBSTAG'] = pd.to_datetime(df['BETRIEBSTAG'], format = "%d.%m.%Y")

# Convert ANKUNFTSZEIT, AN_PROGNOSE, ABFAHRTSZEIT, AB_PROGNOSE to datetime format
df['ANKUNFTSZEIT'] = pd.to_datetime(df['ANKUNFTSZEIT'], format = "%d.%m.%Y %H:%M")
df['AN_PROGNOSE'] = pd.to_datetime(df['AN_PROGNOSE'], format = "%d.%m.%Y %H:%M:%S")
df['ABFAHRTSZEIT'] = pd.to_datetime(df['ABFAHRTSZEIT'], format = "%d.%m.%Y %H:%M")
df['AB_PROGNOSE'] = pd.to_datetime(df['AB_PROGNOSE'], format = "%d.%m.%Y %H:%M:%S")

Now is a good time to finally have a look at the dataframe:

# Let's look at first few rows
df.head()
BETRIEBSTAG FAHRT_BEZEICHNER BETREIBER_ID BETREIBER_ABK BETREIBER_NAME PRODUKT_ID LINIEN_ID LINIEN_TEXT UMLAUF_ID VERKEHRSMITTEL_TEXT ZUSATZFAHRT_TF FAELLT_AUS_TF BPUIC HALTESTELLEN_NAME ANKUNFTSZEIT AN_PROGNOSE AN_PROGNOSE_STATUS ABFAHRTSZEIT AB_PROGNOSE AB_PROGNOSE_STATUS DURCHFAHRT_TF
0 2025-03-05 80:800631:17230:000 80:800631 DB DB Regio AG Baden-Württemberg Zug 17230 RB NaN RB False False 8500090 Basel Bad Bf 2025-03-05 04:59:00 2025-03-05 04:59:00 PROGNOSE NaT NaT NaN False
1 2025-03-05 80:800631:17233:000 80:800631 DB DB Regio AG Baden-Württemberg Zug 17233 RB NaN RB False False 8500090 Basel Bad Bf NaT NaT NaN 2025-03-05 06:07:00 2025-03-05 06:08:00 PROGNOSE False
2 2025-03-05 80:800631:17234:000 80:800631 DB DB Regio AG Baden-Württemberg Zug 17234 RB NaN RB False False 8500090 Basel Bad Bf 2025-03-05 05:56:00 2025-03-05 06:02:00 PROGNOSE NaT NaT NaN False
3 2025-03-05 80:800631:17235:000 80:800631 DB DB Regio AG Baden-Württemberg Zug 17235 RB NaN RB False False 8500090 Basel Bad Bf NaT NaT NaN 2025-03-05 06:43:00 2025-03-05 06:53:00 PROGNOSE False
4 2025-03-05 80:800631:17236:000 80:800631 DB DB Regio AG Baden-Württemberg Zug 17236 RB NaN RB False False 8500090 Basel Bad Bf 2025-03-05 06:31:00 2025-03-05 06:34:00 PROGNOSE NaT NaT NaN False

But what do all these columns mean? I have browsed the metadata a bit and found the following explanations (that I hopefully accurately reproduce in English):

  • BETRIEBSTAG: Simply the day on which the data were recorded.
  • FAHRT_BEZEICHNER: This is some elaborate identifier in the format [UIC-Countrycode]:[GO-Number]:[VM-Number]:[Extended Reference].
  • BETREIBER_ID: [UIC-Countrycode]:[GO-Number]. GO is short for “Geschäftsorganisation”. For foreign organizations it is not a GO-Number but a TU-Number with TU meaning “Transportunternehmen”. It is basically an ID for the company running that particular train.
  • BETREIBER_ABK: The abbreviation for the company running the train.
  • BETREIBER_NAME: The full name of the company running the train.
  • PRODUKT_ID: Type of public transport.
  • LINIEN_ID: The ID for the route of that train.
  • LINIEN_TEXT: The public ID for the route of that train.
  • UMLAUF_ID: An ID for a “Umlauf” which describes the period starting with the vehicle leaving the garage and ending with the vehicle being deposited back in the garage.
  • ZUSATZFAHRT_TF: Is true if it is an extraordinary (not usually scheduled) trip.
  • FAELLT_AUS_TF: Is true if the trip is cancelled.
  • BPUIC: The ID of the station.
  • HALTESTELLEN_NAME: The name of the station.
  • ANKUNFTSZEIT: Planned time of arrival at the station.
  • AN_PROGNOSE: Prediction of time of arrival at the station.
  • AN_PROGNOSE_STATUS: Status of that prediction. Possible values are: “UNBEKANNT”, “leer”, “PROGNOSE”, “GESCHAETZT”, “REAL”. If the value of that column is “REAL”, it means that the predicted time of arrival is the time the train actually arrived at the station.
  • ABFAHRTSZEIT, AB_PROGNOSE, AB_PROGNOSE_STATUS: Same definitions as for arrival but here for departure from the station.
  • DURCHFAHRT_TF: Is true if the vehicle does not stop even if a stop was scheduled.

Let’s now have a look at the values in the column PRODUKT_ID:

# Look at PRODUKT_ID
df["PRODUKT_ID"].value_counts()
PRODUKT_ID
Bus            1965207
Tram            249408
Zug             163649
BUS             124171
Metro             4936
Zahnradbahn       1944
Schiff             975
Name: count, dtype: int64

We can see that trains are only the third most frequent category in this data. However, as mentioned before, we want to keep it simple and now reduce the dataset to only trains.

# First we reduce to only trains
df = df[df['PRODUKT_ID'] == "Zug"]

In a next step, we remove all rows where the corresponding train has been cancelled.

# Filter out all entries with FAELLT_AUS_TF == True
df = df[df['FAELLT_AUS_TF'] == False]

Let’s explore the data a bit more before we move to the second dataset. Let’s check out the most frequent values that occur in the column BETREIBER_NAME:

# Look at BETREIBER_NAME
df["BETREIBER_NAME"].value_counts().head()
BETREIBER_NAME
Schweizerische Bundesbahnen SBB    63850
BLS AG (bls)                       16256
THURBO                             13017
Aargau Verkehr AG                   7131
Schweizerische Südostbahn (sob)     6083
Name: count, dtype: int64

As expected, SBB is the company serving the largest number of stations. What about the column VERKEHRSMITTEL_TEXT?

# Look at VERKEHRSMITTEL_TEXT
df["VERKEHRSMITTEL_TEXT"].value_counts().head()
VERKEHRSMITTEL_TEXT
S     103331
R      34435
RE      9730
IR      7532
IC      3059
Name: count, dtype: int64

We can see that the most frequent type of trains are S-Bahns (S). Finally, let’s check the most frequent train stations that occur in the data:

# Look at HALTESTELLEN_NAME
df["HALTESTELLEN_NAME"].value_counts().head()
HALTESTELLEN_NAME
Zürich HB          2239
Bern               1709
Winterthur          955
Zürich Oerlikon     919
Luzern              843
Name: count, dtype: int64

Unsurprisingly, Zürich and Bern are the most frequent values occuring in the data.

Load and preprocess “Dienststellen-Daten”

Fortunately, we can go through the second dataset a bit more quickly. We again start by loading it and checking the dimensions of the dataframe.

# Load the data
ds = pd.read_csv('actual_date-swiss-only-service_point-2025-03-06.csv', sep = ";", low_memory = False)

# Number of rows and columns
print(ds.shape)
(55308, 55)

The data contains a column validTo that allows us to filter out all stations that are not valid anymore (closed down?). We check the values that appear in this column and see that all stations should be valid as of March 6, 2025. This is no surprise as we use the dataset of currently valid stations.

# Check 'validTo' values.
ds['validTo'].unique()
array(['9999-12-31', '2025-12-13', '2025-04-06', '2026-12-12',
       '2025-08-29', '2027-12-11', '2025-03-16', '2025-05-30',
       '2099-12-31', '2028-12-09', '2025-09-30', '2030-12-14',
       '2025-06-30', '2050-12-31', '2029-12-09', '2025-08-31',
       '2025-03-31', '2025-04-12', '2025-05-16', '2025-03-06'],
      dtype=object)

Let’s also quickly make sure that we have unique rows (based on ‘number’).

# Is the number of unique 'number' (= BPUIC) values equal to the number of rows?
len(pd.unique(ds['number'])) == ds.shape[0]
True

Finally, we keep only the columns we need (identifier, official name, and geo coordinates).

# Keep only the relevant columns
ds = ds[["number","designationOfficial","cantonName","municipalityName","businessOrganisationDescriptionEn","wgs84East","wgs84North","height"]]

# Show first few rows
ds.head()
number designationOfficial cantonName municipalityName businessOrganisationDescriptionEn wgs84East wgs84North height
0 1322001 Antronapiana NaN NaN Autoservizi Comazzi S.R.L. 8.113620 46.060120 0.0
1 1322002 Anzola d'Ossola NaN NaN Autoservizi Comazzi S.R.L. 8.345715 45.989869 0.0
2 1322003 Baceno NaN NaN Autoservizi Comazzi S.R.L. 8.319256 46.261501 0.0
3 1322012 Castiglione NaN NaN Autoservizi Comazzi S.R.L. 8.214886 46.020588 0.0
4 1322013 Ceppo Morelli NaN NaN Autoservizi Comazzi S.R.L. 8.069922 45.971036 0.0

Load and preprocess average traffic data

This part is also fairly easy. We load the data and check the dimensions, as always.

# Load the data
ds_freq = pd.read_csv('t01x-sbb-cff-ffs-frequentia-2023.csv', sep = ";", low_memory = False)

# Number of rows and columns
print(ds_freq.shape)
(3479, 14)

If you actually have a look at the data, you see that many stations have several measurements made at different times (and the times of measurements are identified by Jahr_Annee_Anno). We only want to keep the most recent measurements for every station:

# For every station, we only keep the most recent measurements.
ds_freq = ds_freq.loc[ds_freq.groupby('UIC')['Jahr_Annee_Anno'].idxmax()]

Checking the data types of all columns reveals that there is still a problem with the measurement columns DTV_TJM_TGM, DWV_TMJO_TFM, and DNWV_TMJNO_TMGNL. They are currently of type object because they contain the thousand separator . We thus remove all instances of this characters and transform these columns to integers.

# Data types of columns
ds_freq.dtypes

# Remove thousand separator and make integers out of it.
ds_freq['DTV_TJM_TGM'] = ds_freq['DTV_TJM_TGM'].str.replace('’', '').astype(int)
ds_freq['DWV_TMJO_TFM'] = ds_freq['DWV_TMJO_TFM'].str.replace('’', '').astype(int)
ds_freq['DNWV_TMJNO_TMGNL'] = ds_freq['DNWV_TMJNO_TMGNL'].str.replace('’', '').astype(int)

Finally, we keep only the relevant columns.

# Keep only the relevant columns
ds_freq = ds_freq[["UIC","DTV_TJM_TGM","DWV_TMJO_TFM","DNWV_TMJNO_TMGNL"]]
# Show first few rows
ds_freq.head()
UIC DTV_TJM_TGM DWV_TMJO_TFM DNWV_TMJNO_TMGNL
411 8500010 98600 105900 81900
423 8500016 90 100 60
2024 8500020 5700 7000 2800
2294 8500021 8500 9900 5200
1072 8500022 3600 4100 2300

But what exactly are these three measurement variables? The source dataset provides the following definitions:

  • DTV_TJM_TGM: “Average daily traffic (Monday to Sunday).”
  • DWV_TMJO_TFM: “Average traffic on weekdays (Monday to Friday).”
  • DNWV_TMJNO_TMGNL: “Average non-work day traffic (Saturdays, Sundays and public holidays).”

It is further mentioned that all passengers boarding and exiting the trains are counted. That also means that passengers who switch trains are counted twice. For larger stations, the data may not cover all trains arriving and departing at the corresponding station. For example, the numbers for Bern do not include the traffic generated by the regional train company RBS.

Combine the three datasets

We first merge the traffic data to the “Dienststellen-Daten”:

# Join to 'ds'
ds = pd.merge(ds, ds_freq, left_on = 'number', right_on = 'UIC', how = 'left')

# Drop 'UIC'
ds = ds.drop('UIC', axis=1)

# Better column names
ds.columns = ['BPUIC','STATION_NAME','CANTON','MUNICIPALITY','COMPANY',
              'LONGITUDE','LATITUDE','ELEVATION','AVG_DAILY_TRAFFIC',
              'AVG_DAILY_TRAFFIC_WEEKDAYS','AVG_DAILY_TRAFFIC_WEEKENDS']

Then we merge the “Dienststellen-Daten” to the “Ist-Daten” via the BPUIC variable:

# Left-join with station names and coordinates
df = pd.merge(df, ds, on = 'BPUIC', how = 'left')

Unfortunately, there are some rows (18) for which HALTESTELLEN_NAME is missing. But fortunately, we know which stations are affected based on the STATION_NAME column that we have just merged from ds.

# There are 18 missing values for 'HALTESTELLEN_NAME' which we impute from 'STATION_NAME'.
df.loc[df['HALTESTELLEN_NAME'].isna(), "HALTESTELLEN_NAME"] = df.loc[df['HALTESTELLEN_NAME'].isna(), "STATION_NAME"]

Now, we are finally ready to start extracting the network from this data!

Convert it to a network

As I mentioned several times, every row corresponds to a stop of a train at a train station. One train ride from some initial station to some end station (called “Fahrt” in German) then typically consists of several stops along the way. However, there are some “Fahrten” with only one entry. Presumably these are mostly foreign trains that have their final destination at some border station. I decided to remove those entries:

# First group by FAHRT_BEZEICHNER and then filter out all groups with only one entry
# It's mostly trains that stop at a place at the border (I think)
df_filtered = df.groupby('FAHRT_BEZEICHNER').filter(lambda g: len(g) > 1)

# How many rows do we loose with that?
print(df.shape[0] - df_filtered.shape[0])
420

This preprocessing step removes 420 rows.

Now we group the rows by FAHRT_BEZEICHNER so that each group is one “Fahrt”. In every group we sort the stops along the way in an ascending order of the departure time.

# Function to sort entries within a group in ascending order of ABFAHRTSZEIT
def sort_data(group):
    return group.sort_values('ABFAHRTSZEIT', ascending = True)

# Sort for each group
df_sorted = df_filtered.groupby('FAHRT_BEZEICHNER', group_keys=True).apply(sort_data)

Let’s have a look at one “Fahrt” to get a better idea:

# Look at one example Fahrt
df_sorted.loc[['85:22:1083:000'],['BETREIBER_NAME','LINIEN_TEXT','HALTESTELLEN_NAME','ABFAHRTSZEIT']]
BETREIBER_NAME LINIEN_TEXT HALTESTELLEN_NAME ABFAHRTSZEIT
FAHRT_BEZEICHNER
85:22:1083:000 64346 Appenzeller Bahnen (ab) S23 Gossau SG 2025-03-05 08:21:00
64347 Appenzeller Bahnen (ab) S23 Herisau 2025-03-05 08:28:00
64348 Appenzeller Bahnen (ab) S23 Herisau Wilen 2025-03-05 08:30:00
64349 Appenzeller Bahnen (ab) S23 Waldstatt 2025-03-05 08:34:00
64350 Appenzeller Bahnen (ab) S23 Zürchersmühle 2025-03-05 08:39:00
64351 Appenzeller Bahnen (ab) S23 Urnäsch 2025-03-05 08:43:00
64352 Appenzeller Bahnen (ab) S23 Jakobsbad 2025-03-05 08:48:00
64353 Appenzeller Bahnen (ab) S23 Gonten 2025-03-05 08:50:00
64354 Appenzeller Bahnen (ab) S23 Gontenbad 2025-03-05 08:52:00
64355 Appenzeller Bahnen (ab) S23 Appenzell NaT

This is a train that goes from Gossau to Appenzell with many stops in-between. In Appenzell the ABFAHRTSZEIT is missing as that “Fahrt” ends there (the train will most likely go back in the other direction, but that will be a new “Fahrt”).

We now have enough knowledge about the data that we can extract the edges in a for loop. Basically, what we do is to loop over the rows of a given “Fahrt”, starting with the second row and extracting the edges as

(previous station, current station, travel time between stations).

The Python code for this looks as follows:

# Empty list
edgelist = []

# Variables to store previous row and its index
prev_row = None
prev_idx = None

# Loop over rows of dataframe
for i, row in df_sorted.iterrows():
    # Only start with second row
    # Only if the two rows belong to the same Fahrt
    if prev_idx is not None and prev_idx == i[0]:
        # Add edge to edgelist assuming it's a directed edge
        edgelist.append((prev_row['STATION_NAME'], 
                         row['STATION_NAME'], 
                         (row['ANKUNFTSZEIT'] - prev_row['ABFAHRTSZEIT']).total_seconds() / 60))
    # Set current row and row index to previous ones
    prev_idx = i[0]
    prev_row = row

To get a better idea, let’s have a look at the first list element:

# First list element
edgelist[0]
('Zürich HB', 'Basel SBB', 54.0)

We are still not quite done yet. The problem is that the edgelist contains many duplicated entries as, for example, the stretch Zürich HB - Basel SBB is served by many different trains on a given day.

What we want to do is to go through all possible edges and sum up the number of times they occur. In addition, we would like to average the travel time between a given pair of stations over all trips between the two stations. The following code does exactly that and saves the result in the form of a dictionary.

# Empty dict
edges = {}

# Loop over elements in edgelist
for i in edgelist:
    # Create key
    key = (i[0], i[1])
    # Get previous entries in dict (if there are any)
    prev = edges.get(key, (0, 0))
    # Update values in dict
    edges[key] = (prev[0] + 1, prev[1] + i[2])

# Divide summed up travel times by number of trips
edges = {k: (v[0], round(v[1]/v[0], 2)) for k, v in edges.items()}

Let’s look at the entry for the stretch between Zürich and Basel again:

# Look at some element in dict
edges[('Zürich HB', 'Basel SBB')]
(36, 54.0)

There are 36 trips between these two stations (in this direction) and they take 54 minutes on average.

We are now ready to create the final node list (and export it). First, we reduce ds to the train stations that actually appear in the edges (it still contains many bus and tram stops and other things).

# Set of stations that appear in edgelist
stations_in_edgelist = set(sum(list(edges.keys()), ()))

# Reduces nodes dataframe to only places in edgelist
nodes = ds[ds['STATION_NAME'].isin(stations_in_edgelist)]

Second, we quickly check the number of missing values again.

# Missing values per column
nodes.isna().sum()
BPUIC                           0
STATION_NAME                    0
CANTON                         21
MUNICIPALITY                   21
COMPANY                         0
LONGITUDE                       0
LATITUDE                        0
ELEVATION                       1
AVG_DAILY_TRAFFIC             500
AVG_DAILY_TRAFFIC_WEEKDAYS    500
AVG_DAILY_TRAFFIC_WEEKENDS    500
dtype: int64

There are still some issues here. The one we can solve is the missing elevation. The station Tirano (in Italy) has no value for this column. We simply impute manually (Tirano’s elevation is approximately 441m).

# Impute missing elevation for Tirano
nodes.loc[nodes['STATION_NAME'] == "Tirano", "ELEVATION"] = 441

The missing values for CANTON and MUNICIPALITY concern municipalities abroad (in Germany and Italy mostly). The 500 missing values in the traffic columns are stations are run by smaller companies or stations abroad. There is nothing we can do about all these missing values.

# Have a look
nodes.head()

# Export node list
# nodes.sort_values("BPUIC").to_csv("nodelist.csv", sep = ';', encoding = 'utf-8', index = False)
BPUIC STATION_NAME CANTON MUNICIPALITY COMPANY LONGITUDE LATITUDE ELEVATION AVG_DAILY_TRAFFIC AVG_DAILY_TRAFFIC_WEEKDAYS AVG_DAILY_TRAFFIC_WEEKENDS
12683 8500100 Tavannes Bern Tavannes Swiss Federal Railways SBB 7.201645 47.219845 754.17 1400.0 1600.0 810.0
12684 8500121 Courfaivre Jura Haute-Sorne Swiss Federal Railways SBB 7.291166 47.335083 450.99 420.0 480.0 280.0
12685 8500103 Sorvilier Bern Sorvilier Swiss Federal Railways SBB 7.305794 47.239354 681.07 60.0 70.0 49.0
12688 8500120 Courtételle Jura Courtételle Swiss Federal Railways SBB 7.317943 47.342829 436.90 840.0 970.0 550.0
12689 8500102 Malleray-Bévilard Bern Valbirse Swiss Federal Railways SBB 7.275946 47.238714 698.18 630.0 780.0 280.0

Before we export the edges, we change the station names in the edgelist to the BPUIC to make the edges more compact. Then we transform the dictionary into a dataframe which can finally be exported.

# Create a node dict with BPUIC as values
node_dict = dict(zip(nodes.STATION_NAME, nodes.BPUIC))

# Transform edge dict to nested list and replace all station names with their BPUIC
edges = [[node_dict[k[0]], node_dict[k[1]], v[0], v[1]] for k,v in edges.items()]

# Create a dataframe
edges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2','NUM_CONNECTIONS','AVG_DURATION'])

# Have a look
edges.head()

# Export edge list
# edges.to_csv("edgelist.csv", sep = ';', encoding = 'utf-8', index = False)
BPUIC1 BPUIC2 NUM_CONNECTIONS AVG_DURATION
0 8503000 8500010 36 54.00
1 8500010 8500090 67 6.07
2 8500090 8500010 67 6.39
3 8500010 8503000 39 57.87
4 8503424 8500090 17 73.18

Feel free to download the final results: Nodelist (CSV) and Edgelist (CSV).

The title image has been created by Wikimedia user JoachimKohler-HB and is licensed under Creative Commons.