Home

Behind the scenes with MBTA data.

The MBTA is developing a new Service Plan focused on improving our bus routes to better meet our customer’s needs. For the first time we now have much better data about how passengers are using the bus and rapid transit network from our Origin Destination Inference model (ODx). (More information about this model can be found in this earlier post.)

This model gives us a wealth of data on how people use our system, but in order to make it useful for service planning decisions we have to figure out to analyze and visualize the data. We want to understand Origin and Destination (OD) patterns to find popular ODs that aren’t well served and to analyze the impact of proposed changes. One important consideration is the level of geographic aggregation we use to examine the data. 

The most disaggregate level of OD analysis uses each individual bus stop and station as a point. This can be useful for analyzing small changes to bus routes. But at the network level we want to see patterns at a larger scale or geographic area. We don’t want to assume our existing bus routes and stops, but instead see if there are a lot of people currently traveling from this neighborhood to another area of the region.   

Planners and transportation engineers use a number of different geographic units for analyzing spatial data. Census data comes associated with block groups and tracts. Transportation modelers use Traffic Analysis Zones (TAZs) to analyze OD flows. Using these existing spatial units allows us to compare our OD patterns to other data sets, but come with a key problem.     

The challenge is that the borders of TAZs tend to be major streets, the type of streets where we have bus stops and rail stations. In order to assign a trip origin or destination to a specific TAZ, we need to estimate which side of the street a trip started or ended on--or even if it started in a neighboring TAZ. To address the challenges of assigning OD data to spatial units, we’ve come up with two solutions.

The first approach is to allocate trip data using the straight-line distance from bus stops or stations to TAZ centroids. The closer the centroid to the stop or station, the greater the trip allocation. But rather than using the geographic centroid of the TAZs, we recalculated them using census population data by block and the corresponding jobs information drawn from Longitudinal Employer-Household Dynamics (LEHD) workforce area characteristic (WAC) files. 

We tallied jobs and population for each TAZ and used the Mean Center tool in Arcmap to find the new center of concentration. This tool finds the gravity-based center of the TAZ- the location where the combined density of jobs and population is the greatest. From there, we calculated and ranked the distance from each bus stop or station to the new centroids within a half-mile. Finally, we determined the proportion of each stop-centroid distance to the sum of all stop-centroid distances and used the same proportions to allocate trip origins and destinations to each stop and corresponding TAZ. The table below shows an example of what the final allocation table might look like. If we applied the proportions shown in the “Allocation” column to X number of trip destinations, the TAZ 13035139 would get about 50 percent of them.

IN_FID Near Distance (m) Stop Name TAZ ID Allocation
1 298.003 Ashmont 13035139 0.507
1 399.418 Ashmont 13035139 0.363
1 674.997 Ashmont 13035139 0.116
1 788.712 Ashmont 13035139 0.014

 

Our second approach was to create custom geographies in which the bus stops or stations were very clearly contained. This is where it got fun… and tricky.  We needed a method that accurately partitioned the Boston area into geographical regions that made sense for bus service planning and where all bus stops/stations were inside a region. The goal was to create logical groups of bus stops based on the frequency at which the stops were used. (We also considered using schedule information, but ultimately concluded that grouping stops by actual utilization made the most sense). 

The final method includes a combination of a partitioning algorithm called k-means and multiple GIS steps. K-means is a method of partitioning that divides data into a pre-determined number of clusters. The first step is to choose the number (n) of desired clusters; we chose 200. Next the Python code runs, which through multiple iterations, determines ideal location of the centroids relative to the rest of the observations. Data points are associated with centroids based on Euclidean distance in that they are assigned to the closest centroid. Finally, centroid locations are recalculated until the mean coordinate values are the same for all points assigned to a given cluster. This process continues until the centroid locations no longer change from one iteration to the next. Here’s an artsy (and somewhat hypnotizing) visualization of how k-means works. And if you’d like to read about the work that inspired our decision to use the k-means algorithm, check out MIT Ph.D student, Cecilia Viggiano’s dissertation on bus network sketch planning in London. 

The output table is a list of 200 centroids and their respective latitudes and longitudes. We imported this file into Arcmap and created 200 corresponding polygons using a couple of different Cost-Benefit geoprocessing tools. 

We had to attempt this several times as we ran into an additional issue particularly challenging in Boston. If we used the tool as-is, we got some polygons that stretched over bodies of water, which is usually unrealistic. Stops on either side of the Boston Harbor, or the Mystic River are not considered part of the same “cluster”. We adjusted the methods to account for bodies of water, but we are still considering whether in some locations with a bridge a polygon that crosses the Charles (for example, connecting Harvard’s campus in Lower Allston and Harvard Square itself) could make sense. We’re still finalizing the process and fine-tuning the GIS techniques. 

 Draft of the k-means cluster process

In the above example, BU East and Cambridgeport are separate clusters, but it's possible that the far south portion of Cambridgeport may "belong" with the other side of the river.

The final product will be two GIS shapefile layers that can be used to aggregate Origins and Destinations of trips to visually display OD patterns.  The gravity weighted TAZ file will be used to compare transit OD patterns with the OD patterns from the regional travel demand model. The k-means cluster file will provide a way to look at travel patterns between neighborhoods.