In Part 1 of this series on the Swiss train network, I demonstrated how to construct a directed network where nodes represent stations, and a directed edge exists whenever at least one nonstop train connection links two stations.
For some time, I believed this was the most intuitive graph representation for this context. However, after reading an insightful 2006 paper by Maciej Kurant and Patrick Thiran, I discovered that public transport networks can be represented in (at least) three distinct ways. The graph representation I introduced in Part 1 aligns with what they call the space-of-stops representation.
Yet, depending on the specific questions being asked, two other graph representations can also be useful. In the space-of-changes representation proposed by Kurant and Thiran (2006), an edge exists between any two stations connected by a train on a given “Fahrt”, even if the train makes stops at other stations in between.
The third representation, space-of-stations, includes an undirected edge between two stations only if they are directly connected by railway tracks, with no other station in between. This approach offers a more infrastructure-focused perspective on the network.
Crucially, all three representations share the same set of nodes—namely, all active train stations. What differs is how the edges are defined.
Kurant and Thiran (2006) also highlight how the shortest path length is interpreted differently in each representation:
space-of-stops: The number of train stops on a journey between two stations.
space-of-changes: The number of times a traveler must change trains between two stations.
space-of-stations: The number of stations passed through between two stations.
Lastly, they point out an important subgraph relationship among these representations: space-of-stations is a subgraph of space-of-stops, which in turn is a subgraph of space-of-changes.
As always, we begin the practical part with loading the libraries we are going to use.
import pandas as pdfrom collections import Counter, defaultdict# Check versions of libraries.print("Pandas version:", pd.__version__)# Make sure there is no limit on the number of columns shown.pd.set_option('display.max_columns', None)
Pandas version: 2.1.4
We first present how the space-of-changes representation can be extracted. After that we show one way of finding the edges for the space-of-stations representation.
Space-of-changes
We start by importing the already processed “Ist-Daten” from Part 1. Since we load them from a CSV file we have to transform all date-time information into the Pandas datetime format.
# Load the processed IST-DATEN.df = pd.read_csv('ist-daten.csv', low_memory=False)# Convert BETRIEBSTAG to date formatdf['BETRIEBSTAG'] = pd.to_datetime(df['BETRIEBSTAG'])# Convert ANKUNFTSZEIT, AN_PROGNOSE, ABFAHRTSZEIT, AB_PROGNOSE to datetime formatdf['ANKUNFTSZEIT'] = pd.to_datetime(df['ANKUNFTSZEIT'])df['AN_PROGNOSE'] = pd.to_datetime(df['AN_PROGNOSE'])df['ABFAHRTSZEIT'] = pd.to_datetime(df['ABFAHRTSZEIT'])df['AB_PROGNOSE'] = pd.to_datetime(df['AB_PROGNOSE'])
Next comes the key part of extracting the edges for the space-of-changes representation. We will group the rows by FAHRT_BEZEICHNER. Then, we will use two nested loops to create edges between any station and all subsequent stations on a given “Fahrt”. Note that in contrast to Kurant and Thiran (2006) we will extract directed edges. The following function specifies how the edges can be extracted for one group. It’s not very performant code and there may be smarter and more efficient ways of doing this. But it does the job.
# Function to compute (directed) edges according to spaces-of-changes principle.def get_edges_in_groups(group):# Empty list for results of a group. results = []# Loop over all rows in group.for i inrange(len(group)):# Nested loop over all subsequent rows.for j inrange(i +1, len(group)):# Now, append edge to results list. results.append(( group.iloc[i]["STATION_NAME"], # Station of origin group.iloc[j]["STATION_NAME"], # Station of destination (group.iloc[j]['ANKUNFTSZEIT'] - group.iloc[i]['ABFAHRTSZEIT']).total_seconds() /60# Time (minutes) ))# Return list.return results
We can now apply that function to every group. On my machine, this step took roughly 10 minutes.
# Now apply that function group-wise.edges_series = df.groupby("FAHRT_BEZEICHNER", group_keys=False).apply(get_edges_in_groups)
The output of the previous step is a Pandas series, as the following check confirms. We can see that every element of that series is identified with FAHRT_BEZEICHNER and contains a list with the edges, also including the time between the two nodes.
# Make sure the result is a pandas series.print("Is pandas series:", isinstance(edges_series, pd.Series))# Check first few elements:edges_series.head()
We perform another quick check to make sure the series contains as many elements as there are unique FAHRT_BEZEICHNER strings. That seems to be the case.
# How many elements?print("Number of elements in series:", len(edges_series))# Is that the number of distinct FAHRTEN?df[['FAHRT_BEZEICHNER']].nunique()
Number of elements in series: 14653
FAHRT_BEZEICHNER 14653
dtype: int64
We quickly check the list of edges for one “Fahrt” to make sure it really extracted the edges in the right way.
# Let's check out one FAHRT.edges_series["85:97:9:000"]
This seems to be a train that goes from Yverdon-les-Bains to Ste-Croix. It stops in Vuiteboeuf, Baulmes, and Six-Fontaines before getting to Ste-Croix. There is an edge between every station and all its subsequent stations on that “Fahrt”. This is exactly what we wanted.
Now, we flatten the Pandas series of lists into one edgelist.
# Flatten the result into one edgelist.edgelist = [x for l in edges_series.values for x in l]print("Number of edges:", len(edgelist))
Number of edges: 1110834
This edgelist contains over one million edges. Note, however, that many of them are duplicates as we looped over all “Fahrten” of a given day. As in Part 1, we will now aggregate all duplicate edges, counting the number of connections and the average travel time between any two nodes.
# Empty dictedges = {}# Loop over elements in edgelistfor i in edgelist:# Create key key = (i[0], i[1])# Get previous entries in dict (if there are any) prev = edges.get(key, (0, 0))# Update values in dict edges[key] = (prev[0] +1, prev[1] + i[2])# Divide summed up travel times by number of tripsedges = {k: (v[0], round(v[1]/v[0], 2)) for k, v in edges.items()}print("Number of edges:", len(edges))
Number of edges: 37951
We are left with 37’951 directed and weighted edges that are currently stored in a dict called edges. Let’s see how long it takes to get from Olten to Winterthur and how many connections there are on a given day:
# Testedges[("Olten", "Winterthur")]
(25, 63.84)
I can travel from Olten to Winterthur 25 times per day (without having to change trains) and the trip takes a bit more than an hour.
Now, we import the nodelist from Part 1 so that we can replace the station names in the edges by the BPUIC identifiers.
# Load the nodelist.nodes = pd.read_csv("nodelist.csv", sep =";")# Create a node dict with BPUIC as valuesnode_dict =dict(zip(nodes.STATION_NAME, nodes.BPUIC))
After changing all stations names to BPUIC numbers we create a dataframe that can then be exported as a CSV file. Yay, we’re done!
# Transform edge dict to nested list and replace all station names with their BPUICedges = [[node_dict[k[0]], node_dict[k[1]], v[0], v[1]] for k,v in edges.items()]# Create a dataframeedges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2','NUM_CONNECTIONS','AVG_DURATION'])# Have a lookedges.head()# Export edge list# edges.to_csv("edgelist_SoC.csv", sep = ';', encoding = 'utf-8', index = False)
For the space-of-stations graph representation we make use of the fact that the space-of-stations graph should be a subgraph of the space-of-stops graph that we extracted in Part 1 with the latter containing additional edges that represent shortcuts. For example, the space-of-stops graph contains a directed edge from Olten to Basel SBB as there are nonstop trains between these two stations. However, there are also smaller, regional trains which stop at all stations in between. The key idea (also nicely shown by Kurant and Thiran) is to go through all edges in the space-of-stops graph and identify the ones that are shortcuts.
We first load the (space-of-stops) edgelist from Part 1 and add the station names.
# Load the space-of-stops edgelist.edges = pd.read_csv("edgelist.csv", sep =";")# Create a node dict with station names as values.node_dict =dict(zip(nodes.BPUIC, nodes.STATION_NAME))# Add actual station names.edges["STATION1"] = [node_dict[v] for v in edges["BPUIC1"]]edges["STATION2"] = [node_dict[v] for v in edges["BPUIC2"]]# Check out the dataframe.print(edges.head())print("Number of edges in space-of-stops representation:", edges.shape[0])
BPUIC1 BPUIC2 NUM_CONNECTIONS AVG_DURATION STATION1 STATION2
0 8503000 8500010 36 54.00 Zürich HB Basel SBB
1 8500010 8500090 67 6.07 Basel SBB Basel Bad Bf
2 8500090 8500010 67 6.39 Basel Bad Bf Basel SBB
3 8500010 8503000 39 57.87 Basel SBB Zürich HB
4 8503424 8500090 17 73.18 Schaffhausen Basel Bad Bf
Number of edges in space-of-stops representation: 4215
For the space-of-stations representation, undirected edges make the most sense. Thus, we need to make the directed edges from the space-of-stops representation undirected and remove all duplicates that this introduces (e.g., ‘Olten - Basel SBB’ and ‘Basel SBB - Olten’). With a little help by ChatGPT I found an elegant solution to achieve just that.
More concretely, we iterate over the zip object containing the node pairs of all edges. The min() and max() functions applied to the station names will sort the station names alphabetically so that, for example, ‘Olten - Basel SBB’ and ‘Basel SBB - Olten’ are both transformed to ‘Basel SBB - Olten’. Finally, the set() function will get rid of all duplicates.
# Get a list of unique undirected edges.unique_undirected_edges =list(set((min(e1, e2), max(e1, e2)) for e1, e2 inzip(edges["STATION1"], edges["STATION2"])))print("Number of unique undirected edges:", len(unique_undirected_edges))
Number of unique undirected edges: 2154
This step leaves us with 2’154 undirected, unique edges.
Data preprocessing for improved efficiency
In order to make the procedure further below more efficient, we extract here all unique “Fahrten”. More specifically, we create a dictionary fahrten with the sequence of station names as key and the FAHRT_BEZEICHNER as value. Note that if a sequence of station names already exists as a key in the dict, then the value belonging to that key will be overwritten with the new FAHRT_BEZEICHNER but that doesn’t bother us since we just want to be able to extract one example “Fahrt” per unique sequence of stops.
# Empty dictfahrten = {}# Loop over grouped df.# If the same key (sequence of stops) reappears, the value will be overwritte.# But that behavior is desired: we only want to keep one FAHRT_BEZEICHNER per key.for fahrt, group in df.groupby('FAHRT_BEZEICHNER'): fahrten[tuple(group['STATION_NAME'])] = fahrtprint("Number of unique 'Fahrten':", len(fahrten))print("Number of 'Fahrten' in whole dataframe:", df['FAHRT_BEZEICHNER'].nunique())
Number of unique 'Fahrten': 1678
Number of 'Fahrten' in whole dataframe: 14653
We can see from the above output that this step drastically reduces the “Fahrten” that we will iterate over later.
In the following code chunk we filter the “Ist-Daten” (df) loaded earlier so that only the unique “Fahrten” are left.
# Reduce the dataframe to the 'Fahrten' in list of values of dict.df = df[df['FAHRT_BEZEICHNER'].isin(list(fahrten.values()))]print("Remaining number of rows:", df.shape[0])
Remaining number of rows: 17597
Another little trick to make things more efficent later is to create a dictionary with station names as keys and a list with all FAHRT_BEZEICHNER strings a station name is part of as values (kind of an inverted index).
# defaultdict with listsresult_dict = defaultdict(list)# Iterate over rowsfor _, row in df.iterrows():# Create a dict with stations as keys and FAHRT_BEZEICHNER as values. result_dict[row['STATION_NAME']].append(row['FAHRT_BEZEICHNER'])# Convert back to normal dict.result_dict =dict(result_dict)
Identify shortcuts
Next, we perform the key step in extracting the edges for the space-of-stations representation: we need to identify all edges that are shortcuts, passing train stations without stopping.
We first define a custom function that determines whether any two station names a and b are adjacent in a sequence (list) of station names lst.
# Function to check whether elements a and b are NOT adjacent in lst.def is_shortcut(lst, a, b):returnnotany((x, y) == (a, b) or (x, y) == (b, a) for x, y inzip(lst, lst[1:]))
Then, we iterate over all undirected, unique edges that we prepared above. For each edge we go through the following steps:
We get the FAHRT_BEZEICHNER strings for all “Fahrten” which both nodes of the edge are part of. For this we use the inverted index-style dictionary we created above.
Then we perform an inner loop over the “Fahrten” extracted in the first step.
We first extract the sequence of stations of a “Fahrt”.
We use our custom function from above to check whether the two nodes are adjacent in the sequence of stations.
If they are not adjacent, i.e., the edge represents a shortcut, then we save that edge and break the inner loop and move on to the next edge.
# Empty list for shortcuts.shortcut_edges = []# Loop over list of undirected edges.for idx, edge inenumerate(unique_undirected_edges):# Find all 'Fahrten' in which both stations of the edge appear. intersection =list(set(result_dict[edge[0]]) &set(result_dict[edge[1]]))# Initialize shortcut to False shortcut =False# Loop over 'Fahrten' in which both stations of the edge appear.for fahrt in intersection:# Get the sequence of stations in current 'Fahrt'. seq_of_stations = df.loc[df['FAHRT_BEZEICHNER'] == fahrt, 'STATION_NAME'].tolist()# Check whether the edge represents a shortcut in that sequence. shortcut = is_shortcut(seq_of_stations, edge[0], edge[1])# If it is a shortcut, we add it to the list and break the inner loop.if shortcut:# Add to list and break the loop. shortcut_edges.append((fahrt, edge))breakprint("Number of shortcut edges:", len(shortcut_edges))
Number of shortcut edges: 442
A total of 442 edges are identified as shortcuts. Let’s have a look at the first one:
# Check first shortcut.print(shortcut_edges[0])# Check the 'Fahrt' in which it was detected as a shortcut.df.loc[df['FAHRT_BEZEICHNER'] == shortcut_edges[0][0], 'STATION_NAME']
('85:22:2076:000', ('Bühler', 'Gais'))
1236 Gais
1237 Zweibrücken
1238 Strahlholz
1239 Bühler
1240 Steigbach
1241 Teufen AR
1242 Teufen AR Stofel
1243 Sternen bei Teufen
1244 Niederteufen
1245 Lustmühle
1246 St. Gallen Riethüsli
1247 St. Gallen Güterbahnhof
1248 St. Gallen
1249 St. Gallen Marktplatz
1250 St. Gallen Spisertor
1251 St. Gallen Schülerhaus
1252 St. Gallen Birnbäumen
1253 St. Gallen Notkersegg
1254 Schwarzer Bären
1255 Vögelinsegg
1256 Schützengarten
1257 Speicher
1258 Bendlehn
1259 Gfeld
1260 Trogen
Name: STATION_NAME, dtype: object
From the whole sequence of stations, we can see that the edge identified as a shortcut is, in fact, a connection that is not consecutive.
Finally, we remove the FAHRT_BEZEICHNER from shortcut_edges and create the final edge list without shortcuts.
# Extract only edgesshortcut_edges_clean = [i[1] for i in shortcut_edges]# Get the final list of non-shortcut edges.final_edges = [e for e in unique_undirected_edges if e notin shortcut_edges_clean]print("Number of edges:", len(final_edges))
Number of edges: 1712
We have a final number of edges of \(2154-442=1712\).
Validate with “Liniendaten”
The extraction of the edges in the space-of-stations representation was a bit more complex than for space-of-changes or space-of-stops. That’s why I would like to run some checks.
We can validate some of the edges we extracted with another dataset from the Open Data Portal of SBB. The dataset Linie (Betriebspunkte) contains all railway “lines” maintained by SBB with all “Betriebspunkte” (including stations) that are located along these lines. Let’s load this dataset:
# Load the data about "Linien mit Betriebspunkten"linien = pd.read_csv('linie-mit-betriebspunkten.csv', sep =";")# Reduce to relevant columnslinien = linien[["Name Haltestelle","Linie","KM","Linien Text","BPUIC"]]print("Shape of dataframe:", linien.shape)
Shape of dataframe: (1884, 5)
Let’s have a look:
# Have a look at the dataframelinien.head()
Name Haltestelle
Linie
KM
Linien Text
BPUIC
0
Aarau
649
41.50577
Aarau - Woschnau Tunnel alt
8502113
1
Aarberg
251
95.49304
Palezieux Est - Lyss Nord
8504404
2
Aesch BL
230
113.00006
Delemont Est - Basel SBB Ost
8500117
3
Aespli
455
92.76586
Unterhard BE - Aespli
8515299
4
Aigle
100
39.31241
Lausanne - Simplon Tunnel I - Iselle
8501400
The rows in that dataset are not just stations but also other “Betriebspunkte” (important locations that are needed to run the infrastructure). But we can identify the stations among the “Betriebspunkte” by joining the nodes dataframe on BPUIC and only keeping the entries for which there was a matching row in nodes.
# Join the rows of nodelist based on BPUIC.linien = pd.merge(linien, nodes[["BPUIC","STATION_NAME"]], on ='BPUIC', how ='left')# How many entries have a missing value aka are not stations?print("Number of non-stations:", linien["STATION_NAME"].isna().sum())# Drop all rows that are not stations.linien = linien.dropna(subset = ["STATION_NAME"])print("Number of remaining rows:", linien.shape[0])
Number of non-stations: 979
Number of remaining rows: 905
Next, we group the rows by 'Linie' and sort them in ascending order by 'KM' (where along the line is the “Betriebspunkt” located, in terms of kilometres) so that the stations for each line are sorted in the right order.
# Function to sort entries within a group in ascending order of KMdef sort_data(group):return group.sort_values('KM', ascending =True)# Sort for each grouplinien_sorted = linien.groupby('Linie', group_keys=False).apply(sort_data)# Let's have a look at Linie 290.linien_sorted.loc[linien_sorted['Linie'] ==290]
Name Haltestelle
Linie
KM
Linien Text
BPUIC
STATION_NAME
1351
Ostermundigen
290
110.76500
Bern Wylerfeld - Thun
8507002
Ostermundigen
865
Gumligen
290
113.95844
Bern Wylerfeld - Thun
8507003
Gümligen
270
Rubigen
290
119.03812
Bern Wylerfeld - Thun
8507005
Rubigen
1333
Munsingen
290
122.13149
Bern Wylerfeld - Thun
8507006
Münsingen
695
Wichtrach
290
125.73250
Bern Wylerfeld - Thun
8507007
Wichtrach
892
Kiesen
290
128.30300
Bern Wylerfeld - Thun
8507008
Kiesen
1055
Uttigen
290
131.09513
Bern Wylerfeld - Thun
8507009
Uttigen
We see here for one example (Line 290) that the stations are now nicely sorted in ascending order of 'KM'.
Now, we can create a new column that always contains the station name of the next row using the handy shift() method. The last row within a group will always have a missing value for that new column as there is no next station at the end of a line. So, we drop the last row of each line.
# Create a new column that for each row contains the next stop within the group.linien_sorted["NEXT_STATION"] = linien_sorted.groupby("Linie")["STATION_NAME"].shift(-1)# Drop all rows where 'NEXT_STATION' is missinglinien_sorted = linien_sorted.dropna(subset = ["NEXT_STATION"])
We extract the values of the last two columns as the edges. Importantly, we now sort the node pairs in each edge in the same way as before (alphabetically).
# Now let's extract the edgeslinien_edges =list(zip(linien_sorted['STATION_NAME'], linien_sorted['NEXT_STATION']))# Make sure the tuples are arranged in the same way as above (and unique).linien_edges =list(set((min(e[0], e[1]), max(e[0], e[1])) for e in linien_edges))
We first want to check whether there are edges in linien_edges that are neither a shortcut nor in the final edgelist from above.
# Check which edges are in linien_edges but neither in final_edges nor in shortcut_edges_clean.[x for x in linien_edges if x notin final_edges and x notin shortcut_edges_clean]
There are some candidate edges but I checked all of them manually in the train schedule and none of them seem to have direct train connections. It could be that some of these are old train lines that are not active anymore.
Are there any edges in linien_edges that were classified as shortcuts?
# Are there any edges that I classified as shortcuts?[x for x in linien_edges if x in shortcut_edges_clean]
Yes, but the two lines are in fact shortcuts. Between Niederbipp and Oensingen there is a small station called Niederbipp Industrie. Between Chur and Landquart there are several smaller stations. Note, however, that it could be that even though the train tracks between Chur and Landquart actually pass those smaller stations there is no infrastructure for trains to actually stop.
A manual check of all 1’712 edges reveals that there are more shortcuts that our procedure was not able to identify. For example, the edge (Bern, Zofingen) cannot be identified because there is no other “Fahrt” that contains these two stations and stops somewhere in between. We manually remove such edges. In addition, we add some edges for which I know that there is actually infrastructure (tunnels, high-speed routes) that directly connects the two nodes involved.
# Manually remove edges.final_edges.remove(('Bern', 'Zofingen'))final_edges.remove(('Bern Wankdorf', 'Zürich HB'))final_edges.remove(('Basel Bad Bf', 'Schaffhausen'))# Manually add edges.final_edges.append(('Biasca', 'Erstfeld'))# final_edges.append(('Frutigen', 'Visp')) # Seems to be already in there.final_edges.append(('Bern Wankdorf', 'Rothrist'))
Export the edgelist
Finally, we can export the edges as before for the other representations.
# Create a node dict with BPUIC as valuesnode_dict =dict(zip(nodes.STATION_NAME, nodes.BPUIC))# Transform edge dict to nested list and replace all station names with their BPUICedges = [[node_dict[e[0]], node_dict[e[1]]] for e in final_edges]# Create a dataframeedges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2'])# Have a lookedges.head()# Export edge list# edges.to_csv("edgelist_SoS.csv", sep = ';', encoding = 'utf-8', index = False)
Kurant, M., & Thiran, P. (2006). Extraction and analysis of traffic and topologies of transportation networks. Physical Review E, 74(3), 036114. https://doi.org/10.1103/PhysRevE.74.036114
The title image has been created by Wikimedia user JoachimKohler-HB and is licensed under Creative Commons.