SBB Network Analysis - Part 2

Networks

Author

Martin Sterchi

Published

March 31, 2025

In Part 1 of this series on the Swiss train network, I demonstrated how to construct a directed network where nodes represent stations, and a directed edge exists whenever at least one nonstop train connection links two stations.

For some time, I believed this was the most intuitive graph representation for this context. However, after reading an insightful 2006 paper by Maciej Kurant and Patrick Thiran, I discovered that public transport networks can be represented in (at least) three distinct ways. The graph representation I introduced in Part 1 aligns with what they call the space-of-stops representation.

Yet, depending on the specific questions being asked, two other graph representations can also be useful. In the space-of-changes representation proposed by Kurant and Thiran (2006), an edge exists between any two stations connected by a train on a given “Fahrt”, even if the train makes stops at other stations in between.

The third representation, space-of-stations, includes an undirected edge between two stations only if they are directly connected by railway tracks, with no other station in between. This approach offers a more infrastructure-focused perspective on the network.

Crucially, all three representations share the same set of nodes—namely, all active train stations. What differs is how the edges are defined.

Kurant and Thiran (2006) also highlight how the shortest path length is interpreted differently in each representation:

space-of-stops: The number of train stops on a journey between two stations.
space-of-changes: The number of times a traveler must change trains between two stations.
space-of-stations: The number of stations passed through between two stations.

Lastly, they point out an important subgraph relationship among these representations: space-of-stations is a subgraph of space-of-stops, which in turn is a subgraph of space-of-changes.

As always, we begin the practical part with loading the libraries we are going to use.

import geopy.distance
import pandas as pd
import numpy as np
from collections import Counter, defaultdict

# Check versions of libraries.
print("Pandas version:", pd.__version__)
print("Numpy version:", np.__version__)

# Make sure there is no limit on the number of columns shown.
pd.set_option('display.max_columns', None)

Pandas version: 2.1.4
Numpy version: 1.26.4

We first present how the space-of-changes representation can be extracted. After that we show one way of finding the edges for the space-of-stations representation.

Space-of-changes

We start by importing the already processed “Ist-Daten” from Part 1. Since we load them from a CSV file we have to transform all date-time information into the Pandas datetime format.

# Load the processed IST-DATEN.
df = pd.read_csv('ist-daten.csv', sep=";", low_memory=False)

# Convert BETRIEBSTAG to date format
df['BETRIEBSTAG'] = pd.to_datetime(df['BETRIEBSTAG'])

# Convert ANKUNFTSZEIT, AN_PROGNOSE, ABFAHRTSZEIT, AB_PROGNOSE to datetime format
df['ANKUNFTSZEIT'] = pd.to_datetime(df['ANKUNFTSZEIT'])
df['AN_PROGNOSE'] = pd.to_datetime(df['AN_PROGNOSE'])
df['ABFAHRTSZEIT'] = pd.to_datetime(df['ABFAHRTSZEIT'])
df['AB_PROGNOSE'] = pd.to_datetime(df['AB_PROGNOSE'])

Next comes the key part of extracting the edges for the space-of-changes representation. We will group the rows by FAHRT_BEZEICHNER. Then, we will use two nested loops to create edges between any station and all subsequent stations on a given “Fahrt”. Note that in contrast to Kurant and Thiran (2006) we will extract directed edges. The following function specifies how the edges can be extracted for one group. It’s not very performant code and there may be smarter and more efficient ways of doing this. But it does the job.

# Function to compute (directed) edges according to spaces-of-changes principle.
def get_edges_in_groups(group):
    # Empty list for results of a group.
    results = []
    # Loop over all rows in group.
    for i in range(len(group)):
        # Nested loop over all subsequent rows.
        for j in range(i + 1, len(group)):
            # Now, append edge to results list.
            results.append((
                group.iloc[i]["STATION_NAME"], # Station of origin
                group.iloc[j]["STATION_NAME"], # Station of destination
                (group.iloc[j]['ANKUNFTSZEIT'] - group.iloc[i]['ABFAHRTSZEIT']).total_seconds() / 60 # Time (minutes)
            ))
    # Return list.
    return results

We can now apply that function to every group. On my machine, this step took roughly 10 minutes.

# Now apply that function group-wise.
edges_series = df.groupby("FAHRT_BEZEICHNER", group_keys=False).apply(get_edges_in_groups)

The output of the previous step is a Pandas series, as the following check confirms. We can see that every element of that series is identified with FAHRT_BEZEICHNER and contains a list with the edges, also including the time between the two nodes.

# Make sure the result is a pandas series.
print("Is pandas series:", isinstance(edges_series, pd.Series))

# Check first few elements:
edges_series.head()

Is pandas series: True

FAHRT_BEZEICHNER
60402-NZ-8503000-213400    [(Zürich HB, Basel SBB, 54.0), (Zürich HB, Bas...
60403-NZ-8400058-191500    [(Basel Bad Bf, Basel SBB, 7.0), (Basel Bad Bf...
60408-NZ-8098160-205700    [(Basel Bad Bf, Basel SBB, 8.0), (Basel Bad Bf...
60409-NZ-8503000-195900    [(Zürich HB, Basel SBB, 54.0), (Zürich HB, Bas...
60470-NZ-8503000-205900    [(Zürich HB, Basel SBB, 54.0), (Zürich HB, Bas...
dtype: object

We perform another quick check to make sure the series contains as many elements as there are unique FAHRT_BEZEICHNER strings. That seems to be the case.

# How many elements?
print("Number of elements in series:", len(edges_series))

# Is that the number of distinct FAHRTEN?
df[['FAHRT_BEZEICHNER']].nunique()

Number of elements in series: 15005

FAHRT_BEZEICHNER    15005
dtype: int64

We quickly check the list of edges for one “Fahrt” to make sure it really extracted the edges in the right way.

# Let's check out one FAHRT.
edges_series["85:97:9:000"]

[('Yverdon-les-Bains', 'Vuiteboeuf', 10.0),
 ('Yverdon-les-Bains', 'Baulmes', 14.0),
 ('Yverdon-les-Bains', 'Six-Fontaines', 18.0),
 ('Yverdon-les-Bains', 'Ste-Croix', 33.0),
 ('Vuiteboeuf', 'Baulmes', 4.0),
 ('Vuiteboeuf', 'Six-Fontaines', 8.0),
 ('Vuiteboeuf', 'Ste-Croix', 23.0),
 ('Baulmes', 'Six-Fontaines', 4.0),
 ('Baulmes', 'Ste-Croix', 19.0),
 ('Six-Fontaines', 'Ste-Croix', 15.0)]

This seems to be a train that goes from Yverdon-les-Bains to Ste-Croix. It stops in Vuiteboeuf, Baulmes, and Six-Fontaines before getting to Ste-Croix. There is an edge between every station and all its subsequent stations on that “Fahrt”. This is exactly what we wanted.

Now, we flatten the Pandas series of lists into one edgelist.

# Flatten the result into one edgelist.
edgelist = [x for l in edges_series.values for x in l]

print("Number of edges:", len(edgelist))

Number of edges: 1110766

This edgelist contains over one million edges. Note, however, that many of them are duplicates as we looped over all “Fahrten” of a given day. As in Part 1, we will now aggregate all duplicate edges, counting the number of connections and the average travel time between any two nodes.

# Empty dict
edges = {}

# Loop over elements in edgelist
for i in edgelist:
    # Create key
    key = (i[0], i[1])
    # Get previous entries in dict (if there are any)
    prev = edges.get(key, (0, 0))
    # Update values in dict
    edges[key] = (prev[0] + 1, prev[1] + i[2])

# Divide summed up travel times by number of trips
edges = {k: (v[0], round(v[1]/v[0], 2)) for k, v in edges.items()}

print("Number of edges:", len(edges))

Number of edges: 37947

We are left with 37’947 directed and weighted edges that are currently stored in a dict called edges. Let’s see how long it takes to get from Olten to Winterthur and how many connections there are on a given day:

# Test
edges[("Olten", "Winterthur")]

(25, 63.84)

I can travel from Olten to Winterthur 25 times per day (without having to change trains) and the trip takes a bit more than an hour.

Now, there is still a small problem (which I acutally only found out about after creating a network with networkX): there are two self-loops!

print(edges[("Monthey-En Place", "Monthey-En Place")])
print(edges[("Les Planches (Aigle)", "Les Planches (Aigle)")])

(47, 9.87)
(36, 9.22)

I checked the trips in which these stations occur and the trips actually do visit the same station twice. So, our code did the right thing, these are just two odd trips. I decided to remove those two edges:

# Remove the two self-loops
edges.pop(("Les Planches (Aigle)", "Les Planches (Aigle)"))
edges.pop(("Monthey-En Place", "Monthey-En Place"))

(47, 9.87)

Now, we import the nodelist from Part 1 so that we can replace the station names in the edges by the BPUIC identifiers.

# Load the nodelist.
nodes = pd.read_csv("nodelist.csv", sep = ";")

# Create a node dict with BPUIC as values
node_dict = dict(zip(nodes.STATION_NAME, nodes.BPUIC))

After changing all stations names to BPUIC numbers we create a dataframe that can then be exported as a CSV file. Yay, we’re done!

# Transform edge dict to nested list and replace all station names with their BPUIC
edges = [[node_dict[k[0]], node_dict[k[1]], v[0], v[1]] for k,v in edges.items()]

# Create a dataframe
edges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2','NUM_CONNECTIONS','AVG_DURATION'])

# Have a look
edges.head()

# Export edge list
# edges.to_csv("edgelist_SoCha.csv", sep = ';', encoding = 'utf-8', index = False)

	BPUIC1	BPUIC2	NUM_CONNECTIONS	AVG_DURATION
0	8503000	8500010	88	63.81
1	8503000	8500090	14	81.36
2	8500010	8500090	67	6.07
3	8500090	8500010	67	6.39
4	8500090	8503000	10	100.00

You can download the result here: Edgelist space-of-changes (CSV).

Space-of-stations

For the space-of-stations graph representation we make use of the fact that the space-of-stations graph should be a subgraph of the space-of-stops graph that we extracted in Part 1 with the latter containing additional edges that represent shortcuts. For example, the space-of-stops graph contains a directed edge from Olten to Basel SBB as there are nonstop trains between these two stations. However, there are also smaller, regional trains which stop at all stations in between. The key idea (also nicely shown by Kurant and Thiran) is to go through all edges in the space-of-stops graph and identify the ones that are shortcuts.

We first load the (space-of-stops) edgelist from Part 1 and add the station names.

# Load the space-of-stops edgelist.
edges = pd.read_csv("edgelist_SoSto.csv", sep = ";")

# Create a node dict with station names as values.
node_dict = dict(zip(nodes.BPUIC, nodes.STATION_NAME))

# Add actual station names.
edges["STATION1"] = [node_dict[v] for v in edges["BPUIC1"]]
edges["STATION2"] = [node_dict[v] for v in edges["BPUIC2"]]

# Check out the dataframe.
print(edges.head())

print("Number of edges in space-of-stops representation:", edges.shape[0])

    BPUIC1   BPUIC2  NUM_CONNECTIONS  AVG_DURATION      STATION1      STATION2
0  8503000  8500010               36         54.00     Zürich HB     Basel SBB
1  8500010  8500090               67          6.07     Basel SBB  Basel Bad Bf
2  8500090  8500010               67          6.39  Basel Bad Bf     Basel SBB
3  8500010  8503000               39         57.87     Basel SBB     Zürich HB
4  8506286  8506271               34          3.00     Appenzell     Gontenbad
Number of edges in space-of-stops representation: 4211

For the space-of-stations representation, undirected edges make the most sense. Thus, we need to make the directed edges from the space-of-stops representation undirected and remove all duplicates that this introduces (e.g., ‘Olten - Basel SBB’ and ‘Basel SBB - Olten’). With a little help by ChatGPT I found an elegant solution to achieve just that.

More concretely, we iterate over the zip object containing the node pairs of all edges. The min() and max() functions applied to the station names will sort the station names alphabetically so that, for example, ‘Olten - Basel SBB’ and ‘Basel SBB - Olten’ are both transformed to ‘Basel SBB - Olten’. Finally, the set() function will get rid of all duplicates.

# Get a list of unique undirected edges.
unique_undirected_edges = list(set((min(e1, e2), max(e1, e2)) for e1, e2 in zip(edges["STATION1"], edges["STATION2"])))

print("Number of unique undirected edges:", len(unique_undirected_edges))

Number of unique undirected edges: 2152

This step leaves us with 2’152 undirected, unique edges.

Data preprocessing for improved efficiency

In order to make the procedure further below more efficient, we extract here all unique “Fahrten”. More specifically, we create a dictionary fahrten with the sequence of station names as key and the FAHRT_BEZEICHNER as value. Note that if a sequence of station names already exists as a key in the dict, then the value belonging to that key will be overwritten with the new FAHRT_BEZEICHNER but that doesn’t bother us since we just want to be able to extract one example “Fahrt” per unique sequence of stops.

# Empty dict
fahrten = {}

# Loop over grouped df.
# If the same key (sequence of stops) reappears, the value will be overwritte.
# But that behavior is desired: we only want to keep one FAHRT_BEZEICHNER per key.
for fahrt, group in df.groupby('FAHRT_BEZEICHNER'):
    fahrten[tuple(group['STATION_NAME'])] = fahrt

print("Number of unique 'Fahrten':", len(fahrten))
print("Number of 'Fahrten' in whole dataframe:", df['FAHRT_BEZEICHNER'].nunique())

Number of unique 'Fahrten': 1686
Number of 'Fahrten' in whole dataframe: 15005

We can see from the above output that this step drastically reduces the “Fahrten” that we will iterate over later.

In the following code chunk we filter the “Ist-Daten” (df) loaded earlier so that only the unique “Fahrten” are left.

# Reduce the dataframe to the 'Fahrten' in list of values of dict.
df = df[df['FAHRT_BEZEICHNER'].isin(list(fahrten.values()))]

print("Remaining number of rows:", df.shape[0])

Remaining number of rows: 17603

Another little trick to make things more efficent later is to create a dictionary with station names as keys and a list with all FAHRT_BEZEICHNER strings a station name is part of as values (kind of an inverted index).

# defaultdict with lists
result_dict = defaultdict(list)

# Iterate over rows
for _, row in df.iterrows():
    # Create a dict with stations as keys and FAHRT_BEZEICHNER as values.
    result_dict[row['STATION_NAME']].append(row['FAHRT_BEZEICHNER'])

# Convert back to normal dict.
result_dict = dict(result_dict)

Identify shortcuts

Next, we perform the key step in extracting the edges for the space-of-stations representation: we need to identify all edges that are shortcuts, passing train stations without stopping.

We first define a custom function that determines whether any two station names a and b are adjacent in a sequence (list) of station names lst.

# Function to check whether elements a and b are NOT adjacent in lst.
def is_shortcut(lst, a, b):
    return not any((x, y) == (a, b) or (x, y) == (b, a) for x, y in zip(lst, lst[1:]))

Then, we iterate over all undirected, unique edges that we prepared above. For each edge we go through the following steps:

We get the FAHRT_BEZEICHNER strings for all “Fahrten” which both nodes of the edge are part of. For this we use the inverted index-style dictionary we created above.
Then we perform an inner loop over the “Fahrten” extracted in the first step.
- We first extract the sequence of stations of a “Fahrt”.
- We use our custom function from above to check whether the two nodes are adjacent in the sequence of stations.
- If they are not adjacent, i.e., the edge represents a shortcut, then we save that edge and break the inner loop and move on to the next edge.

# Empty list for shortcuts.
shortcut_edges = []

# Loop over list of undirected edges.
for idx, edge in enumerate(unique_undirected_edges):
    # Find all 'Fahrten' in which both stations of the edge appear.
    intersection = list(set(result_dict[edge[0]]) & set(result_dict[edge[1]]))
    # Initialize shortcut to False
    shortcut = False
    # Loop over 'Fahrten' in which both stations of the edge appear.
    for fahrt in intersection:
        # Get the sequence of stations in current 'Fahrt'.
        seq_of_stations = df.loc[df['FAHRT_BEZEICHNER'] == fahrt, 'STATION_NAME'].tolist()
        # Check whether the edge represents a shortcut in that sequence.
        shortcut = is_shortcut(seq_of_stations, edge[0], edge[1])
        # If it is a shortcut, we add it to the list and break the inner loop.
        if shortcut:
            # Add to list and break the loop.
            shortcut_edges.append((fahrt, edge))
            break

print("Number of shortcut edges:", len(shortcut_edges))

Number of shortcut edges: 443

A total of 443 edges are identified as shortcuts. Let’s have a look at the first one:

# Check first shortcut.
print(shortcut_edges[0])

# Check the 'Fahrt' in which it was detected as a shortcut.
df.loc[df['FAHRT_BEZEICHNER'] == shortcut_edges[0][0], 'STATION_NAME']

('ch:1:sjyid:100015:15793-002', ('Gampelen', 'Marin-Epagnier'))

84149         Neuchâtel
84150     St-Blaise-Lac
84151    Marin-Epagnier
84152        Zihlbrücke
84153          Gampelen
84154               Ins
84155      Müntschemier
84156           Kerzers
Name: STATION_NAME, dtype: object

From the whole sequence of stations, we can see that the edge identified as a shortcut is, in fact, a connection that is not consecutive.

Finally, we remove the FAHRT_BEZEICHNER from shortcut_edges and create the final edge list without shortcuts.

# Extract only edges
shortcut_edges_clean = [i[1] for i in shortcut_edges]

# Get the final list of non-shortcut edges.
final_edges = [e for e in unique_undirected_edges if e not in shortcut_edges_clean]

print("Number of edges:", len(final_edges))

Number of edges: 1709

We have a final number of edges of \(2152-443=1709\).

Validate with “Liniendaten”

The extraction of the edges in the space-of-stations representation was a bit more complex than for space-of-changes or space-of-stops. That’s why I would like to run some checks.

We can validate some of the edges we extracted with another dataset from the Open Data Portal of SBB. The dataset Linie (Betriebspunkte) contains all railway “lines” maintained by SBB with all “Betriebspunkte” (including stations) that are located along these lines. Let’s load this dataset:

# Load the data about "Linien mit Betriebspunkten"
linien = pd.read_csv('linie-mit-betriebspunkten.csv', sep = ";")

# Reduce to relevant columns
linien = linien[["Name Haltestelle","Linie","KM","Linien Text","BPUIC"]]

print("Shape of dataframe:", linien.shape)

Shape of dataframe: (1884, 5)

Let’s have a look:

# Have a look at the dataframe
linien.head()

	Name Haltestelle	Linie	KM	Linien Text	BPUIC
0	Aarau	649	41.50577	Aarau - Woschnau Tunnel alt	8502113
1	Aarberg	251	95.49304	Palezieux Est - Lyss Nord	8504404
2	Aesch BL	230	113.00006	Delemont Est - Basel SBB Ost	8500117
3	Aespli	455	92.76586	Unterhard BE - Aespli	8515299
4	Aigle	100	39.31241	Lausanne - Simplon Tunnel I - Iselle	8501400

The rows in that dataset are not just stations but also other “Betriebspunkte” (important locations that are needed to run the infrastructure). But we can identify the stations among the “Betriebspunkte” by joining the nodes dataframe on BPUIC and only keeping the entries for which there was a matching row in nodes.

# Join the rows of nodelist based on BPUIC.
linien = pd.merge(linien, nodes[["BPUIC","STATION_NAME"]], on = 'BPUIC', how = 'left')

# How many entries have a missing value aka are not stations?
print("Number of non-stations:", linien["STATION_NAME"].isna().sum())

# Drop all rows that are not stations.
linien = linien.dropna(subset = ["STATION_NAME"])

print("Number of remaining rows:", linien.shape[0])

Number of non-stations: 979
Number of remaining rows: 905

Next, we group the rows by 'Linie' and sort them in ascending order by 'KM' (where along the line is the “Betriebspunkt” located, in terms of kilometres) so that the stations for each line are sorted in the right order.

# Function to sort entries within a group in ascending order of KM
def sort_data(group):
    return group.sort_values('KM', ascending = True)

# Sort for each group
linien_sorted = linien.groupby('Linie', group_keys=False).apply(sort_data)

# Let's have a look at Linie 290.
linien_sorted.loc[linien_sorted['Linie'] == 290]

	Name Haltestelle	Linie	KM	Linien Text	BPUIC	STATION_NAME
1351	Ostermundigen	290	110.76500	Bern Wylerfeld - Thun	8507002	Ostermundigen
865	Gumligen	290	113.95844	Bern Wylerfeld - Thun	8507003	Gümligen
270	Rubigen	290	119.03812	Bern Wylerfeld - Thun	8507005	Rubigen
1333	Munsingen	290	122.13149	Bern Wylerfeld - Thun	8507006	Münsingen
695	Wichtrach	290	125.73250	Bern Wylerfeld - Thun	8507007	Wichtrach
892	Kiesen	290	128.30300	Bern Wylerfeld - Thun	8507008	Kiesen
1055	Uttigen	290	131.09513	Bern Wylerfeld - Thun	8507009	Uttigen

We see here for one example (Line 290) that the stations are now nicely sorted in ascending order of 'KM'.

Now, we can create a new column that always contains the station name of the next row using the handy shift() method. We then do the same with the KM column and compute the distance between any subsequent stations. We will use those distances later on as edge weights for this representation.

The last row within a group will always have a missing value for those new columns as there is no next station at the end of a line. So, we drop the last row of each line.

# Create a new column that for each row contains the next stop within the group.
linien_sorted["NEXT_STATION"] = linien_sorted.groupby("Linie")["STATION_NAME"].shift(-1)

# Do the same for KM.
linien_sorted["NEXT_STATION_KM"] = linien_sorted.groupby("Linie")["KM"].shift(-1)

# Compute distance.
linien_sorted["DISTANCE"] = linien_sorted["NEXT_STATION_KM"] - linien_sorted["KM"]

# Drop all rows where 'NEXT_STATION' is missing
linien_sorted = linien_sorted.dropna(subset = ["NEXT_STATION"])

We now extract the values of the columns STATION_NAME and NEXT_STATION and ignore the distances for now. We will use this to validate our approach. Importantly, we sort the node pairs in each edge in the same way as before (alphabetically).

# Now let's extract the edges
linien_edges = list(zip(linien_sorted['STATION_NAME'], linien_sorted['NEXT_STATION']))

# Make sure the tuples are arranged in the same way as above (and unique).
linien_edges = list(set((min(e[0], e[1]), max(e[0], e[1])) for e in linien_edges))

As for the validation, we first want to check whether there are edges in linien_edges that are neither a shortcut nor in the final edgelist from above.

# Check which edges are in linien_edges but neither in final_edges nor in shortcut_edges_clean.
[x for x in linien_edges if x not in final_edges and x not in shortcut_edges_clean]

[('Neuhausen Rheinfall', 'Rafz'),
 ('Büren an der Aare', 'Solothurn'),
 ('Dietfurt', 'Kaltbrunn'),
 ('Koblenz', 'Laufenburg'),
 ('Solothurn', 'Zollikofen'),
 ('Concise', 'Grandson'),
 ('Langenthal', 'Niederbipp'),
 ('Chambrelien', 'Corcelles-Peseux'),
 ('Bauma', 'Hinwil'),
 ('Niederbipp', 'Solothurn'),
 ('La Chaux-de-Fonds', 'Tavannes')]

There are some candidate edges but I checked all of them manually in the train schedule and none of them seem to have direct train connections. It could be that some of these are old train lines that are not active anymore.

Are there any edges in linien_edges that were classified as shortcuts?

# Are there any edges that I classified as shortcuts?
[x for x in linien_edges if x in shortcut_edges_clean]

[('Chur', 'Landquart'), ('Brig', 'Visp'), ('Niederbipp', 'Oensingen')]

Yes, but the three connections are in fact shortcuts. Between Niederbipp and Oensingen there is a small station called Niederbipp Industrie. Between Brig and Visp there is a small station called Eyholz. Between Chur and Landquart there are several smaller stations. Note, however, that it could be that even though the train tracks between Chur and Landquart actually pass those smaller stations there is no infrastructure for trains to actually stop.

A manual check of the edges reveals that there are other shortcuts that our procedure was not able to identify. For example, the edge (Bern, Zofingen) cannot be identified because there is no other “Fahrt” that contains these two stations and stops somewhere in between. We manually remove such edges. In addition, we add some edges for which I know that there is actually infrastructure (tunnels, high-speed routes) that directly connects the two nodes involved.

# Manually remove edges.
final_edges.remove(('Bern', 'Zofingen'))
final_edges.remove(('Bern Wankdorf', 'Zürich HB'))
final_edges.remove(('Morges', 'Yverdon-les-Bains'))
final_edges.remove(('Aarau', 'Sissach'))
final_edges.remove(('Bergün/Bravuogn', 'Pontresina'))
final_edges.remove(('Interlaken West', 'Spiez'))
final_edges.remove(('Biel/Bienne', 'Grenchen Nord'))
final_edges.remove(('Chambrelien', 'Neuchâtel'))
final_edges.remove(('Concise', 'Yverdon-les-Bains'))
final_edges.remove(('Etoy', 'Rolle'))
final_edges.remove(('Klosters Platz', 'Susch')) # Avoid several edges representing the Vereina tunnel

# Manually add edges.
final_edges.append(('Biasca', 'Erstfeld')) # New Gotthard tunnel
final_edges.append(('Bern Wankdorf', 'Rothrist')) # Bahn-2000
final_edges.append(('Chambrelien', 'Corcelles-Peseux')) # Connector that was missing
final_edges.append(('Concise', 'Grandson')) # Connector that was missing
final_edges.append(('Immensee', 'Rotkreuz')) # Connector that was missing
final_edges.append(('Olten', 'Rothrist')) # Connector not going through Aarburg-Oftringen
final_edges.append(('Rothrist', 'Solothurn')) # Bahn-2000
final_edges.append(('Aarau', 'Däniken SO')) # Eppenberg tunnel
final_edges.append(('Liestal', 'Muttenz')) # Adler tunnel
final_edges.append(('Thalwil', 'Zürich HB')) # Zimmerberg tunnel
final_edges.append(('Zürich Altstetten', 'Zürich HB')) # Separate infrastructure connecting the two stations

Funny thing is that the two edges, ('Chambrelien', 'Corcelles-Peseux') and ('Concise', 'Grandson'), that our validation procedure proposed as missing edges did actually need to be added upon further inspection. They were missing connectors when I visually inspected the network.

After these final modifications of the edgelist, the total number of edges is 1’709, since we add and remove exactly 11 edges (which is a coincidence).

Edge weights

As edge weights, we will compute the distances between stations. We will have exact distances for the lines maintained by SBB (we already computed them above based on the dataset Linie (Betriebspunkte)). For all other edges, we will simply compute the direct distance based on the coordinates of the stations.

In a first step, we augment every edge in our edgelist with the direct distance:

# New column with coordinates in the same column.
nodes['coord'] = list(zip(nodes.LATITUDE, nodes.LONGITUDE))

# Define a function to compute direct distance.
def compute_distance(station1, station2):
    return geopy.distance.geodesic(
        nodes.loc[nodes['STATION_NAME'] == station1, "coord"].item(), 
        nodes.loc[nodes['STATION_NAME'] == station2, "coord"].item()).km

# Compute direct distances between node pairs.
final_edges = [(e[0], e[1], compute_distance(e[0], e[1])) for e in final_edges]

Then, in a second step we modify every edge that appears in the validation data (based on the dataset Linie (Betriebspunkte)) and fill in the exact distance. For this we transform the validation data into a dictionary for fast lookups. The get() method then allows for easy replacement of distances:

# List of edges including distances.
linien_edges = list(zip(linien_sorted['STATION_NAME'], linien_sorted['NEXT_STATION'], linien_sorted['DISTANCE']))

# Make sure the tuples are arranged in the same way as above (and unique).
linien_edges = list(set((min(e[0], e[1]), max(e[0], e[1]), e[2]) for e in linien_edges))

# Convert to a dict.
linien_edges_dict = {(e[0], e[1]): e[2] for e in linien_edges}

# Add exact distance for edges that exist in dict with exact distance.
final_edges = [(n1, n2, dist, linien_edges_dict.get((n1, n2), np.nan)) for n1, n2, dist in final_edges]

Export the edgelist

Finally, we can export the edges as before for the other representations. Note that before we export the data we correct one small mistake in the edge Baar Lindenpark - Zug. The exact distance derived from the dataset Linie (Betriebspunkte) is way too large and thus we simply impute the geodesic distance for this one edge.

# Create a node dict with BPUIC as values
node_dict = dict(zip(nodes.STATION_NAME, nodes.BPUIC))

# Transform edge dict to nested list and replace all station names with their BPUIC.
# Also, round the distances to 4 decimal points.
edges = [[node_dict[e[0]], node_dict[e[1]], round(e[2], 4), round(e[3], 4)] for e in final_edges]

# Create a dataframe
edges = pd.DataFrame(edges, columns = ['BPUIC1','BPUIC2','DISTANCE_GEODESIC','DISTANCE_EXACT'])

# Correct mistake in edge between Baar Lindenpark and Zug.
edges.loc[(edges['BPUIC1'] == 8515993) & (edges['BPUIC2'] == 8502204), 'DISTANCE_EXACT'] = 1.0593

# Have a look
edges.head()

# Export edge list
# edges.to_csv("edgelist_SoSta.csv", sep = ';', encoding = 'utf-8', index = False)

	BPUIC1	BPUIC2	DISTANCE_GEODESIC	DISTANCE_EXACT
0	8503234	8503233	1.8435	1.9635
1	8506311	8506322	1.2191	1.2216
2	8505004	8502205	6.0461	NaN
3	8501169	8587727	0.8863	NaN
4	8500021	8517131	1.6846	NaN

The Swiss railway network with a geographic layout, Space-of-Stations representation (created using Gephi).

You can download the result here: Edgelist space-of-stations (CSV).

References

Kurant, M., & Thiran, P. (2006). Extraction and analysis of traffic and topologies of transportation networks. Physical Review E, 74(3), 036114. https://doi.org/10.1103/PhysRevE.74.036114

The title image has been created by Wikimedia user JoachimKohler-HB and is licensed under Creative Commons.