Update 10.03.2025: I updated the analysis in this blog so that it runs on more recent data. More precisely, I use the train traffic data from March 5, 2025 to construct the network. Moreover, I now properly reference the source data and I have added a bunch of additional node attributes. The most interesting new node attributes are average passenger frequency data for all stations.
For quite some time I have been wondering if there are some interesting Swiss data that would serve as the basis for some fun network analysis. As a fan of public transportation and a long-time owner of a Swiss train pass (“GA”), the answer should have been obvious much sooner: the Swiss railway network.
I wanted to create a (static) network in which each node corresponds to a train station and each directed edge between any two nodes, A and B, means there is at least one train going nonstop from A to B. Ideally, the edge would also be attributed with some weight representing the importance of the edge (e.g., how many trains go nonstop from A to B on a given day).
The structure of this post is as follows. I will first introduce the three datasets that I used to create the network. I will then show how to load and preprocess each one of them and how to join them. Finally, I will present how to transform those data into a form that is suitable for network analysis. The following image shows a visualization of the network data resulting from this post.
The Swiss railway network with a geographic layout (created using Gephi).
This is the first part of a series that will cover all kinds of fun network analysis based on the Swiss railway network.
Data sources
It was not that obvious how a network with nodes and edges following the definitions given above could be constructed based on data from the Swiss Federal Railways (abbreviated by the German speakers in Switzerland as SBB). With some help from SBB Experts and the Open Data Plattform Mobility Switzerland, I finally found the right data.
The first and most important dataset is called Ist-Daten and, for a given day, contains all regular stops of all trains in Switzerland with their planned and effective arrival and departure times. From this data, we can infer all nonstop stretches of any train in Switzerland. A description of this dataset can be found here.
Note that the “Ist-Daten” not only contain the data for trains but also for all other public transport (buses, trams, and even boats). To keep things simple we will focus on the train network.
The second dataset is the Dienststellen-Daten which basically allows to add node attributes such as the geographic coordinates of a node (i.e., a train station). A description of this dataset can be found here.
Here, we will load and preprocess the “Ist-Daten” from which we can derive the edges of our network. First, I import some Python libraries and print their version number for better reproducibility of this code.
import pandas as pdimport numpy as npfrom collections import Counter# Check versions of libraries.print("NumPy version:", np.__version__)print("Pandas version:", pd.__version__)# Make sure there is no limit on the number of columns shown.pd.set_option('display.max_columns', None)
NumPy version: 1.26.4
Pandas version: 2.1.4
Let’s now load the data. You can see in the filename that I downloaded the “Ist-Daten” from the SBB portal for March 5, 2025. You can get the data for any day you want here.
# Load the datadf = pd.read_csv('2025-03-05_istdaten.csv', sep=";", low_memory=False)
To get a feeling for the data, let’s check the number of rows and columns.
# Number of rows and columnsprint(df.shape)
(2510290, 21)
Ok, it’s actually a pretty big dataset: it has over 2.5 million rows. That makes sense as this file contains every stop of every vehicle involved in public transport on a given day. Thus, every row corresponds to a stop of a train, bus, or any other vehicle of public transport.
We can see that some columns contain many missing values. The only one I worry about for now is the column PRODUKT_ID. If you look through these rows (I don’t show that here), you can see that they should all be of type “Zug” (train). Thus, we impute accordingly:
Now is a good time to finally have a look at the dataframe:
# Let's look at first few rowsdf.head()
BETRIEBSTAG
FAHRT_BEZEICHNER
BETREIBER_ID
BETREIBER_ABK
BETREIBER_NAME
PRODUKT_ID
LINIEN_ID
LINIEN_TEXT
UMLAUF_ID
VERKEHRSMITTEL_TEXT
ZUSATZFAHRT_TF
FAELLT_AUS_TF
BPUIC
HALTESTELLEN_NAME
ANKUNFTSZEIT
AN_PROGNOSE
AN_PROGNOSE_STATUS
ABFAHRTSZEIT
AB_PROGNOSE
AB_PROGNOSE_STATUS
DURCHFAHRT_TF
0
2025-03-05
80:800631:17230:000
80:800631
DB
DB Regio AG Baden-Württemberg
Zug
17230
RB
NaN
RB
False
False
8500090
Basel Bad Bf
2025-03-05 04:59:00
2025-03-05 04:59:00
PROGNOSE
NaT
NaT
NaN
False
1
2025-03-05
80:800631:17233:000
80:800631
DB
DB Regio AG Baden-Württemberg
Zug
17233
RB
NaN
RB
False
False
8500090
Basel Bad Bf
NaT
NaT
NaN
2025-03-05 06:07:00
2025-03-05 06:08:00
PROGNOSE
False
2
2025-03-05
80:800631:17234:000
80:800631
DB
DB Regio AG Baden-Württemberg
Zug
17234
RB
NaN
RB
False
False
8500090
Basel Bad Bf
2025-03-05 05:56:00
2025-03-05 06:02:00
PROGNOSE
NaT
NaT
NaN
False
3
2025-03-05
80:800631:17235:000
80:800631
DB
DB Regio AG Baden-Württemberg
Zug
17235
RB
NaN
RB
False
False
8500090
Basel Bad Bf
NaT
NaT
NaN
2025-03-05 06:43:00
2025-03-05 06:53:00
PROGNOSE
False
4
2025-03-05
80:800631:17236:000
80:800631
DB
DB Regio AG Baden-Württemberg
Zug
17236
RB
NaN
RB
False
False
8500090
Basel Bad Bf
2025-03-05 06:31:00
2025-03-05 06:34:00
PROGNOSE
NaT
NaT
NaN
False
But what do all these columns mean? I have browsed the metadata a bit and found the following explanations (that I hopefully accurately reproduce in English):
BETRIEBSTAG: Simply the day on which the data were recorded.
FAHRT_BEZEICHNER: This is some elaborate identifier in the format [UIC-Countrycode]:[GO-Number]:[VM-Number]:[Extended Reference].
BETREIBER_ID: [UIC-Countrycode]:[GO-Number]. GO is short for “Geschäftsorganisation”. For foreign organizations it is not a GO-Number but a TU-Number with TU meaning “Transportunternehmen”. It is basically an ID for the company running that particular train.
BETREIBER_ABK: The abbreviation for the company running the train.
BETREIBER_NAME: The full name of the company running the train.
PRODUKT_ID: Type of public transport.
LINIEN_ID: The ID for the route of that train.
LINIEN_TEXT: The public ID for the route of that train.
UMLAUF_ID: An ID for a “Umlauf” which describes the period starting with the vehicle leaving the garage and ending with the vehicle being deposited back in the garage.
ZUSATZFAHRT_TF: Is true if it is an extraordinary (not usually scheduled) trip.
FAELLT_AUS_TF: Is true if the trip is cancelled.
BPUIC: The ID of the station.
HALTESTELLEN_NAME: The name of the station.
ANKUNFTSZEIT: Planned time of arrival at the station.
AN_PROGNOSE: Prediction of time of arrival at the station.
AN_PROGNOSE_STATUS: Status of that prediction. Possible values are: “UNBEKANNT”, “leer”, “PROGNOSE”, “GESCHAETZT”, “REAL”. If the value of that column is “REAL”, it means that the predicted time of arrival is the time the train actually arrived at the station.
ABFAHRTSZEIT, AB_PROGNOSE, AB_PROGNOSE_STATUS: Same definitions as for arrival but here for departure from the station.
DURCHFAHRT_TF: Is true if the vehicle does not stop even if a stop was scheduled.
Let’s now have a look at the values in the column PRODUKT_ID:
# Look at PRODUKT_IDdf["PRODUKT_ID"].value_counts()
PRODUKT_ID
Bus 1965207
Tram 249408
Zug 163649
BUS 124171
Metro 4936
Zahnradbahn 1944
Schiff 975
Name: count, dtype: int64
We can see that trains are only the third most frequent category in this data. However, as mentioned before, we want to keep it simple and now reduce the dataset to only trains.
# First we reduce to only trainsdf = df[df['PRODUKT_ID'] =="Zug"]
In a next step, we remove all rows where the corresponding train has been cancelled.
# Filter out all entries with FAELLT_AUS_TF == Truedf = df[df['FAELLT_AUS_TF'] ==False]
Let’s explore the data a bit more before we move to the second dataset. Let’s check out the most frequent values that occur in the column BETREIBER_NAME:
# Look at BETREIBER_NAMEdf["BETREIBER_NAME"].value_counts().head()
Unsurprisingly, Zürich and Bern are the most frequent values occuring in the data.
Load and preprocess “Dienststellen-Daten”
Fortunately, we can go through the second dataset a bit more quickly. We again start by loading it and checking the dimensions of the dataframe.
# Load the datads = pd.read_csv('actual_date-swiss-only-service_point-2025-03-06.csv', sep =";", low_memory =False)# Number of rows and columnsprint(ds.shape)
(55308, 55)
The data contains a column validTo that allows us to filter out all stations that are not valid anymore (closed down?). We check the values that appear in this column and see that all stations should be valid as of March 6, 2025. This is no surprise as we use the dataset of currently valid stations.
Let’s also quickly make sure that we have unique rows (based on ‘number’).
# Is the number of unique 'number' (= BPUIC) values equal to the number of rows?len(pd.unique(ds['number'])) == ds.shape[0]
True
Finally, we keep only the columns we need (identifier, official name, and geo coordinates).
# Keep only the relevant columnsds = ds[["number","designationOfficial","cantonName","municipalityName","businessOrganisationDescriptionEn","wgs84East","wgs84North","height"]]# Show first few rowsds.head()
number
designationOfficial
cantonName
municipalityName
businessOrganisationDescriptionEn
wgs84East
wgs84North
height
0
1322001
Antronapiana
NaN
NaN
Autoservizi Comazzi S.R.L.
8.113620
46.060120
0.0
1
1322002
Anzola d'Ossola
NaN
NaN
Autoservizi Comazzi S.R.L.
8.345715
45.989869
0.0
2
1322003
Baceno
NaN
NaN
Autoservizi Comazzi S.R.L.
8.319256
46.261501
0.0
3
1322012
Castiglione
NaN
NaN
Autoservizi Comazzi S.R.L.
8.214886
46.020588
0.0
4
1322013
Ceppo Morelli
NaN
NaN
Autoservizi Comazzi S.R.L.
8.069922
45.971036
0.0
Load and preprocess average traffic data
This part is also fairly easy. We load the data and check the dimensions, as always.
# Load the datads_freq = pd.read_csv('t01x-sbb-cff-ffs-frequentia-2023.csv', sep =";", low_memory =False)# Number of rows and columnsprint(ds_freq.shape)
(3479, 14)
If you actually have a look at the data, you see that many stations have several measurements made at different times (and the times of measurements are identified by Jahr_Annee_Anno). We only want to keep the most recent measurements for every station:
# For every station, we only keep the most recent measurements.ds_freq = ds_freq.loc[ds_freq.groupby('UIC')['Jahr_Annee_Anno'].idxmax()]
Checking the data types of all columns reveals that there is still a problem with the measurement columns DTV_TJM_TGM, DWV_TMJO_TFM, and DNWV_TMJNO_TMGNL. They are currently of type object because they contain the thousand separator ’. We thus remove all instances of this characters and transform these columns to integers.
# Data types of columnsds_freq.dtypes# Remove thousand separator and make integers out of it.ds_freq['DTV_TJM_TGM'] = ds_freq['DTV_TJM_TGM'].str.replace('’', '').astype(int)ds_freq['DWV_TMJO_TFM'] = ds_freq['DWV_TMJO_TFM'].str.replace('’', '').astype(int)ds_freq['DNWV_TMJNO_TMGNL'] = ds_freq['DNWV_TMJNO_TMGNL'].str.replace('’', '').astype(int)
Finally, we keep only the relevant columns.
# Keep only the relevant columnsds_freq = ds_freq[["UIC","DTV_TJM_TGM","DWV_TMJO_TFM","DNWV_TMJNO_TMGNL"]]
# Show first few rowsds_freq.head()
UIC
DTV_TJM_TGM
DWV_TMJO_TFM
DNWV_TMJNO_TMGNL
411
8500010
98600
105900
81900
423
8500016
90
100
60
2024
8500020
5700
7000
2800
2294
8500021
8500
9900
5200
1072
8500022
3600
4100
2300
But what exactly are these three measurement variables? The source dataset provides the following definitions:
DTV_TJM_TGM: “Average daily traffic (Monday to Sunday).”
DWV_TMJO_TFM: “Average traffic on weekdays (Monday to Friday).”
DNWV_TMJNO_TMGNL: “Average non-work day traffic (Saturdays, Sundays and public holidays).”
It is further mentioned that all passengers boarding and exiting the trains are counted. That also means that passengers who switch trains are counted twice. For larger stations, the data may not cover all trains arriving and departing at the corresponding station. For example, the numbers for Bern do not include the traffic generated by the regional train company RBS.
Combine the three datasets
We first merge the traffic data to the “Dienststellen-Daten”:
# Join to 'ds'ds = pd.merge(ds, ds_freq, left_on ='number', right_on ='UIC', how ='left')# Drop 'UIC'ds = ds.drop('UIC', axis=1)# Better column namesds.columns = ['BPUIC','STATION_NAME','CANTON','MUNICIPALITY','COMPANY','LONGITUDE','LATITUDE','ELEVATION','AVG_DAILY_TRAFFIC','AVG_DAILY_TRAFFIC_WEEKDAYS','AVG_DAILY_TRAFFIC_WEEKENDS']
Then we merge the “Dienststellen-Daten” to the “Ist-Daten” via the BPUIC variable:
# Left-join with station names and coordinatesdf = pd.merge(df, ds, on ='BPUIC', how ='left')
Unfortunately, there are some rows (18) for which HALTESTELLEN_NAME is missing. But fortunately, we know which stations are affected based on the STATION_NAME column that we have just merged from ds.
# There are 18 missing values for 'HALTESTELLEN_NAME' which we impute from 'STATION_NAME'.df.loc[df['HALTESTELLEN_NAME'].isna(), "HALTESTELLEN_NAME"] = df.loc[df['HALTESTELLEN_NAME'].isna(), "STATION_NAME"]
Now, we are finally ready to start extracting the network from this data!
Convert it to a network
As I mentioned several times, every row corresponds to a stop of a train at a train station. One train ride from some initial station to some end station (called “Fahrt” in German) then typically consists of several stops along the way. However, there are some “Fahrten” with only one entry. Presumably these are mostly foreign trains that have their final destination at some border station. I decided to remove those entries:
# First group by FAHRT_BEZEICHNER and then filter out all groups with only one entry# It's mostly trains that stop at a place at the border (I think)df_filtered = df.groupby('FAHRT_BEZEICHNER').filter(lambda g: len(g) >1)# How many rows do we loose with that?print(df.shape[0] - df_filtered.shape[0])
420
This preprocessing step removes 420 rows.
Now we group the rows by FAHRT_BEZEICHNER so that each group is one “Fahrt”. In every group we sort the stops along the way in an ascending order of the departure time.
# Function to sort entries within a group in ascending order of ABFAHRTSZEITdef sort_data(group):return group.sort_values('ABFAHRTSZEIT', ascending =True)# Sort for each groupdf_sorted = df_filtered.groupby('FAHRT_BEZEICHNER', group_keys=True).apply(sort_data)
Let’s have a look at one “Fahrt” to get a better idea:
# Look at one example Fahrtdf_sorted.loc[['85:22:1083:000'],['BETREIBER_NAME','LINIEN_TEXT','HALTESTELLEN_NAME','ABFAHRTSZEIT']]
BETREIBER_NAME
LINIEN_TEXT
HALTESTELLEN_NAME
ABFAHRTSZEIT
FAHRT_BEZEICHNER
85:22:1083:000
64346
Appenzeller Bahnen (ab)
S23
Gossau SG
2025-03-05 08:21:00
64347
Appenzeller Bahnen (ab)
S23
Herisau
2025-03-05 08:28:00
64348
Appenzeller Bahnen (ab)
S23
Herisau Wilen
2025-03-05 08:30:00
64349
Appenzeller Bahnen (ab)
S23
Waldstatt
2025-03-05 08:34:00
64350
Appenzeller Bahnen (ab)
S23
Zürchersmühle
2025-03-05 08:39:00
64351
Appenzeller Bahnen (ab)
S23
Urnäsch
2025-03-05 08:43:00
64352
Appenzeller Bahnen (ab)
S23
Jakobsbad
2025-03-05 08:48:00
64353
Appenzeller Bahnen (ab)
S23
Gonten
2025-03-05 08:50:00
64354
Appenzeller Bahnen (ab)
S23
Gontenbad
2025-03-05 08:52:00
64355
Appenzeller Bahnen (ab)
S23
Appenzell
NaT
This is a train that goes from Gossau to Appenzell with many stops in-between. In Appenzell the ABFAHRTSZEIT is missing as that “Fahrt” ends there (the train will most likely go back in the other direction, but that will be a new “Fahrt”).
We now have enough knowledge about the data that we can extract the edges in a for loop. Basically, what we do is to loop over the rows of a given “Fahrt”, starting with the second row and extracting the edges as
(previous station, current station, travel time between stations).
The Python code for this looks as follows:
# Empty listedgelist = []# Variables to store previous row and its indexprev_row =Noneprev_idx =None# Loop over rows of dataframefor i, row in df_sorted.iterrows():# Only start with second row# Only if the two rows belong to the same Fahrtif prev_idx isnotNoneand prev_idx == i[0]:# Add edge to edgelist assuming it's a directed edge edgelist.append((prev_row['STATION_NAME'], row['STATION_NAME'], (row['ANKUNFTSZEIT'] - prev_row['ABFAHRTSZEIT']).total_seconds() /60))# Set current row and row index to previous ones prev_idx = i[0] prev_row = row
To get a better idea, let’s have a look at the first list element:
# First list elementedgelist[0]
('Zürich HB', 'Basel SBB', 54.0)
We are still not quite done yet. The problem is that the edgelist contains many duplicated entries as, for example, the stretch Zürich HB - Basel SBB is served by many different trains on a given day.
What we want to do is to go through all possible edges and sum up the number of times they occur. In addition, we would like to average the travel time between a given pair of stations over all trips between the two stations. The following code does exactly that and saves the result in the form of a dictionary.
# Empty dictedges = {}# Loop over elements in edgelistfor i in edgelist:# Create key key = (i[0], i[1])# Get previous entries in dict (if there are any) prev = edges.get(key, (0, 0))# Update values in dict edges[key] = (prev[0] +1, prev[1] + i[2])# Divide summed up travel times by number of tripsedges = {k: (v[0], round(v[1]/v[0], 2)) for k, v in edges.items()}
Let’s look at the entry for the stretch between Zürich and Basel again:
# Look at some element in dictedges[('Zürich HB', 'Basel SBB')]
(36, 54.0)
There are 36 trips between these two stations (in this direction) and they take 54 minutes on average.
We are now ready to create the final node list (and export it). First, we reduce ds to the train stations that actually appear in the edges (it still contains many bus and tram stops and other things).
# Set of stations that appear in edgeliststations_in_edgelist =set(sum(list(edges.keys()), ()))# Reduces nodes dataframe to only places in edgelistnodes = ds[ds['STATION_NAME'].isin(stations_in_edgelist)]
Second, we quickly check the number of missing values again.
# Missing values per columnnodes.isna().sum()
BPUIC 0
STATION_NAME 0
CANTON 21
MUNICIPALITY 21
COMPANY 0
LONGITUDE 0
LATITUDE 0
ELEVATION 1
AVG_DAILY_TRAFFIC 500
AVG_DAILY_TRAFFIC_WEEKDAYS 500
AVG_DAILY_TRAFFIC_WEEKENDS 500
dtype: int64
There are still some issues here. The one we can solve is the missing elevation. The station Tirano (in Italy) has no value for this column. We simply impute manually (Tirano’s elevation is approximately 441m).
# Impute missing elevation for Tiranonodes.loc[nodes['STATION_NAME'] =="Tirano", "ELEVATION"] =441
The missing values for CANTON and MUNICIPALITY concern municipalities abroad (in Germany and Italy mostly). The 500 missing values in the traffic columns are stations are run by smaller companies or stations abroad. There is nothing we can do about all these missing values.
# Have a looknodes.head()# Export node list# nodes.sort_values("BPUIC").to_csv("nodelist.csv", sep = ';', encoding = 'utf-8', index = False)
BPUIC
STATION_NAME
CANTON
MUNICIPALITY
COMPANY
LONGITUDE
LATITUDE
ELEVATION
AVG_DAILY_TRAFFIC
AVG_DAILY_TRAFFIC_WEEKDAYS
AVG_DAILY_TRAFFIC_WEEKENDS
12683
8500100
Tavannes
Bern
Tavannes
Swiss Federal Railways SBB
7.201645
47.219845
754.17
1400.0
1600.0
810.0
12684
8500121
Courfaivre
Jura
Haute-Sorne
Swiss Federal Railways SBB
7.291166
47.335083
450.99
420.0
480.0
280.0
12685
8500103
Sorvilier
Bern
Sorvilier
Swiss Federal Railways SBB
7.305794
47.239354
681.07
60.0
70.0
49.0
12688
8500120
Courtételle
Jura
Courtételle
Swiss Federal Railways SBB
7.317943
47.342829
436.90
840.0
970.0
550.0
12689
8500102
Malleray-Bévilard
Bern
Valbirse
Swiss Federal Railways SBB
7.275946
47.238714
698.18
630.0
780.0
280.0
Before we export the edges, we change the station names in the edgelist to the BPUIC to make the edges more compact. Then we transform the dictionary into a dataframe which can finally be exported.
# Create a node dict with BPUIC as valuesnode_dict =dict(zip(nodes.STATION_NAME, nodes.BPUIC))# Transform edge dict to nested list and replace all station names with their BPUICedges = [[node_dict[k[0]], node_dict[k[1]], v[0], v[1]] for k,v in edges.items()]# Create a dataframeedges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2','NUM_CONNECTIONS','AVG_DURATION'])# Have a lookedges.head()# Export edge list# edges.to_csv("edgelist.csv", sep = ';', encoding = 'utf-8', index = False)