Behind the scenes with MBTA data.

We are excited to announce our new Open Data Portal!

The MBTA has published the MBTA Ridership and Service Statistics, also known as “The Blue Book” since 1988. According to the Blue Book from 2005,

“The MBTA receives frequent inquiries from customers, students, peer transit providers, government agencies, community organizations, transportation enthusiasts, and the media for information regarding its operations, and this book is intended to address these needs. Additionally, this book serves as a management and analytical tool for MBTA staff.”  

The most recent Blue Book edition was released in 2014 and contained data on ridership, bus speeds, track distances, fleet rosters, and more. However, the Blue Book has largely lacked consistency in updates. Moreover, because of the Blue Book’s print format, published data has only existed in an aggregated, non-interactive form, and has failed to include extensive historical data.

As demand for more up to date data increases, and to address the shortcomings of previous Blue Books, the offices of Performance Management & Innovation and Transportation Planning have worked to create an online open data portal. The portal, which went public on Monday, October 7, is designed to be easily navigable and searchable by mode type and data category. Many of the datasets on the portal had previously been available on other platforms, such as our performance API, though users will now have the ability to download customized datasets using an in-program filtering option . Metrics on reliability and performance are available as well as historical data about the MBTA’s financials, assets, and system information. New data will be added in the future.

Though we will no longer be publishing new Blue Book versions, the MBTA Open Data Portal provides the same service. The portal allows users to download data directly from the site and view reported figures within applications and maps on the portal. Datasets that are GTFS-compatible are marked so and can be downloaded for outside visual development. Data exists on the portal in its non-aggregated form to increase options for the user, but can be aggregated by the user upon download to mimic the reporting of the previous Blue Book editions.

How To Use

View all datasets with mode (rapid transit, bus, etc.) or category (ridership, performance, etc.) tags by clicking on the icons under the mode and category headers. Alternately, you can explore all public MBTA datasets by searching for keywords in the dataset title, summary, tags, or description using the search bar at the top of the page. Clicking inside the search bar and pressing enter without entering text will populate all public datasets within the portal. Under the “Overview” tab of a dataset, you can view the description, data dictionary, data limitations, attributes, related data, and metadata. The download and API buttons on the right side of the page allow for download format selection. Under the “Data” tab, you can sort and filter the records by any of the attributes and then download only those filtered records. The “API Explorer” tab shows the query functionality and query URL for the particular dataset.

In our previous post about passenger walk distances, we used the Rider Census to examine how accessible transit is to its users and found that passengers walked further than the assumed half-mile to stations at the ends of the Red and Orange Lines, while they walked less than this to stations in the center of our region. Our main conclusion, which is perhaps obvious, was that the structure of the network itself has a large impact on how passengers interact with the network.

We wanted to use this data set to look at passengers’ entire journeys rather than just their access point. To do so, we developed a metric we call “substitution propensity.” In a transportation network, each station is only attractive for a set number of destinations. For example, Savin Hill is a station on the Red Line, so Savin Hill is useful for trips north to downtown Boston. However, for trips west to Ruggles or Dudley Square, Savin Hill is not as useful; it’s likely that people would walk to the nearby stop for the 15 bus instead. In other cases, two nearby stations might serve very similar journeys: for example, much of the E branch of the Green Line and the Orange Line run nearly parallel to each other.

Substitution, as it relates to walkability, is defined here as the propensity at which passengers exclusively choose a particular route over other nearby alternative routes. Substitution explains differences in how passengers choose to access MBTA services: passengers will walk for longer distances in areas in which there are fewer service options. This is also a useful metric for determining what qualities passengers value in MBTA services. For example,there may be situations in which bus routes are not substituted for rail routes even when the bus route is faster because passengers may value frequency over faster travel times. 


To measure substitution, we used the 2015-2017 Rider Census data, which includes information about the most recent journey survey respondents took using the MBTA system. We categorized each journey by its starting mode, or the type of service used at the start of the respondent’s journey, and its ending mode, or the type of service used at the end of the respondent’s journey. We defined four categories for the starting mode and ending modes: commuter rail, bus, light rail (the Green and Mattapan lines), and heavy rail (the Red, Orange, and Blue Lines). This resulted in each journey being assigned to one of 16 categories. To give an example, for a passenger who begins their journey at Lynn, takes the commuter rail to North Station, transfers to the Green Line and finishes their journey at Prudential, the journey would be classified as “Commuter Rail to Light Rail.”

While the survey data provided helpful insights on clustering and completed journeys, we had to account for undersampled evening commutes in the data set. We assumed that the trips from point A to point B by morning commuters are duplicated as trips from point B to point A by those same commuters in the evening, assuming that passengers use the same MBTA service for both commutes.

We then used the k-nearest-neighbors algorithm for each journey in the 2015-17 Rider Census to select the ten most similar origin-destination pairs. We determined similarity on the basis of a passenger’s origin and destination locations. The origin location would be the latitude-longitude coordinates of the street intersection nearest to the passenger’s home, and the destination location would be the latitude-longitude coordinates of the street intersection nearest to their workplace. The ten most similar journeys were determined by using four-dimensional Euclidian distance which are the longitude and latitude of the passenger’s origin point and the longitude and latitude of the passenger’s destination point. We calculated the percentage of the ten most similar journeys that belonged to the same category. That measure is the propensity for substitution.Using the same origin-destination pairs, if journeys among passengers varies greatly, the substitution percentage approaches 100%. If journeys do not vary, the percentage approaches 0%.

Next, we mapped the substitution metric in QGIS. The survey data was converted to a spatial point data set, with the location of the point determined by the latitude-longitude coordinate of the origin location. We duplicated the survey data while reversing the origin locations and the destination locations, effectively mapping every journey as two points: one representing the origin location and the other representing the destination location. Adjacent points were grouped into 500m hexagons, and the average propensity for substitution was calculated for each hexagon. At 100%, the ten nearest neighbors of journeys that started and ended in that hexagon were taken using the same MBTA service, on average. Alternatively, at 0%,the ten nearest neighbors of journeys that started and ended in that hexagon were taken using the different MBTA services.



A few interesting trends are shown in the substitution map above. Immediately beyond the terminal stations of the Red and Orange lines, the metric approaches 0%; this is probably because some passengers choose to walk to the Red and Orange line stations, while other passengers choose to take a bus. Many passengers choose to take other MBTA services rather than walk near terminal stations that have large average walk distances, since this walk distance is less acceptable for different people. Another interesting observation is that substitution near Andrew and Broadway, the two Red line stations that serve South Boston, is relatively low; this is most likely because passengers are choosing to take one of the many bus routes rather than the Red Line. In fact, the eastern half of South Boston has a cluster of hexagons with percentages over 80%, meaning that the bus route is practical enough that passengers forego the walk to Broadway or Andrew.

To illustrate the usefulness of this approach, we conducted an analysis focused specifically on South Boston. Five bus lines converge on City Point at the edge of South Boston: Routes 5, 7, 9, 10, and 11. We filtered the survey data to identify trips that started or ended with one of those bus lines (n=696), and since the survey data is biased towards morning trips, duplicated the survey data while flipping the starting and ending locations. We then applied the same k-nearest-neighbors algorithm to the data, and mapped the data using the same procedure. The resulting data showed the same cluster around City Point where all five of the bus lines converge.

Subsequently, we grouped the individual points using the k-nearest neighbor algorithm in to twenty clusters. The four variables we used to cluster the data were the origin location latitude, origin location longitude, destination location latitude, and destination location longitude. We filtered out the clusters with less than 20 data points, leaving twelve clusters, which enabled us to identify unusual trip patterns and ignore them. For each usable cluster, we calculated the average substitution percent and plotted the clusters as lines, with the endpoints of the lines representing the average origin and destination locations of passenger journeys in that particular cluster.

The resulting map illustrates that passengers using the bus network, whose journeys start or end near the western portion of South Boston, typically use the same bus route. Passengers whose journeys begin near Andrew or Broadway, however, use different bus routes to get to serve the same journey. This is potentially a sign that some of the bus routes in South Boston could be consolidated without substantially impacting passenger experience.


In the last two posts, we have used the Rider Census data set to examine how people access transit in greater detail than is usually possible. First, we found that the distance traveled to access transit on foot varies much more than the commonly applied rule of thumb of ½ mile. In this post, we found that people, perhaps unsurprisingly, use different transit services when they have multiple options. Importantly, we do not know from this analysis if an individual might choose different services on different days, nor the reasons why they might choose one service over another. Future analysis can examine these questions, using this and other survey data.

To use the MBTA, passengers typically have to walk, drive, or otherwise travel between our stations and their homes, offices, and schools. The question of how passengers travel between stations and their ultimate origin or destination is called the “last mile problem.” Typically, when the MBTA tries to answer questions involving the last mile problem (e.g., determining how many jobs are within walking distance of T stations), we assume that passengers won’t walk more than half a mile. However, studies of walking distances of different subway networks have found that walk distances vary considerably from station to station. In this blog post, we are going to explore how walk distances may vary from station to station in our MBTA network. 

For this post, we’re using survey answers from our most recent Rider Census, where passengers were asked to provide information about their most recent trip on the MBTA, including the location of their origin and destination. This provides us an opportunity to calculate how far passengers walk between their ultimate origins and destinations and MBTA stations. For each rail station and Silver Line station, as well as for each bus line, we used bootstrapping to calculate a confidence interval for the average distance passengers walk to and from MBTA stops. We then focused our analysis on the Red and Orange Lines, and identified three interesting trends: passengers walked longer distances to reach stations at the ends of the Red and Orange Lines, passengers walked shorter distances to stations constrained by bodies of water, and passengers walked shorter distances to stations in the middle of the Orange Line. 

Methodology and Data Sources

As mentioned, the MBTA and CTPS recently conducted a systemwide passenger survey. For the survey, we asked passengers about their most recent trip on the MBTA. The survey asked passengers to list their origin and destination locations—where they are coming from before arriving at a MBTA stop/station and where they are traveling to after completing their trip on the MBTA. They were able to classify these locations in a variety of ways, like home, workplace, school, etc. The survey then asked passengers to list their mode of travel (driving, walking, biking, or use of a non-MBTA service) when going to and from the T in order to learn more about this “last mile.” Passengers listed the specific MBTA service they used (e.g. Green Line, bus route 7, Fitchburg Line, etc.) and at what specific stops they boarded and alighted. Passengers also provided basic demographic information.

Not every respondent provided an origin or destination location, so we separated the dataset into two groups: responses that included an origin location, and responses that included a destination location. (Responses that included both an origin and destination location were counted twice.) Since we are investigating walkability, we filtered the datasets so that they only contained responses from passengers who identified their access and egress modes as walking. This left 15,934 responses from passengers who identified an origin location and walked to their first MBTA boarding and 18,161 responses from passengers who identified a destination location and walked from their last MBTA alighting.

For each of the responses, we calculated the walk distance by calculating the straight line distance in meters from origin and destination locations to the location where they boarded or exited their first or last MBTA service experience. In cases where passengers were using rail or Silver Line service, the survey identified the exact stop at which passengers boarded and exited the service. However, in cases where passengers were using bus service, the survey did not identify the exact stop at which passengers boarded and exited; the survey only identified the bus line that passengers took. Therefore, we assumed that bus passengers would walk to the bus stop closest to their origin or destination location, and used the bus stop nearest to the passenger’s origin or destination location to calculate the walk distance.

Finally, we filtered out stops and bus lines that had less than thirty data points. The Green and Blue Lines did not have a lot of stations with more than thirty data points, whereas the Red and Orange Line stations all had more than thirty data points each . Therefore, we decided to focus on the Red and Orange Lines for the purposes of this blog post. We mapped the mean and median walk distances for the Red and Orange Lines in QGIS (we did not map the walk distances for Downtown Crossing, as that station is shared by the Red and the Orange Line).

Possible limitations of the data include:

  • The number of responses for each station and line are not proportional to the ridership of the respective stations/lines.
  • Women, English speakers, and regular MBTA passengers were more likely to respond to the survey.
  • Because passengers were asked to describe their most recent trip, the survey responses were often biased towards trips taken in the morning.


Line Station Number Datapoints Mean Walk Distance Mean Lower CI Mean Upper CI Median Walk Distance Median Lower CI Median Upper CI
Orange Line Assembly Station 57 513.657 437.840 587.424 320.230 200.572 320.230
Orange Line  Back Bay  388  497.198 359.258 603.427 331.830 311.550 333.988
Orange Line  Downtown Crossing  387  462.723 366.399 545.144 301.413 286.933 330.088
Orange Line  Forest Hills  119  712.659 578.353 839.518 429.573 325.896 507.278
Orange Line  Malden Center  151  718.944 527.984 852.199 561.643 535.360 579.449
Orange Line Mass Ave  221  466.500 382.312 532.923 261.808 199.563 261.808
Orange Line North Station  303  558.156 299.258 709.093 272.793 234.126 272.793
Orange Line Oak Grove 77  1142.579 778.101 1446.468 716.906 647.668 904.608
Orange Line Sullivan Square 90 674.677 547.066 786.468 432.475 334.869 433.871
Red Line Alewife 188 832.947 647.254 980.625 658.636 558.152 727.017
Red Line Charles MGH 934 261.047 238.093 280.972 175.378 140.958 175.378
Red Line Davis Square 462 787.405 493.758 961.490 585.864 544.225 650.085
Red Line Downtown Crossing 501 526.546 412.159 621.614 306.902 290.354 339.374
Red Line Kendall Square 1218 421.666 400.187 441.639 315.829 315.829 315.829
Red Line South Station 684 424.869 354.815 484.151 264.058 219.488 282.041


Confidence Interval = CI

* Click the link to view the above table with more stations listed.An image of the mean walk distances for the Red and Orange Lines.

An image of the median walk distances for the Red and Orange Lines.


There are a number of interesting conclusions that can be drawn from the mean and median walk distances from each station. We have tried classifying them into a few main trends as explained below:

Physical Landscape — Safety & Geography

The Charles MGH station is notable for having a substantially lower median and mean walk distance compared to the other Red Line stations. There are a few possible explanations for this. First, the built environment of Charles MGH is particularly inconvenient to pedestrians: the station has only two crosswalks, two entrances, and is surrounded by busy roads. Pedestrians are also constrained by two geographic features--Beacon Hill (the hill, not the neighborhood) and the Charles River—which could limit how far pedestrians are able to walk to reach Charles MGH. Kendall Square, which has the fourth lowest mean walking distance out of all Red Line stations, also is adjacent to the Charles River, which provides further evidence for bodies of water like the Charles River affecting station walkability. A similar effect can be seen at Assembly Station on the Orange Line, which is surrounded by the Mystic River and Interstate 93, and has a lower average and median walkshed than the adjacent stations (Sullivan Square and Malden).

Last Stations on Subway Lines

Stations at the ends of the Red and Orange Lines—Alewife, Davis, Forest Hills, Malden, and Oak Grove—tend to have larger average and median walk distances. This is probably because passengers who live beyond the reach of the Red and Orange Lines prefer the Red and Orange Lines to alternative MBTA services (the bus network and the commuter rail), and are willing to walk further distances to reach the Red and Orange Lines. 


The stations at the center of the Orange Line—beginning at around Mass Ave and ending at around North Station—tend to have lower medians and means compared to other Orange Line stations. There are a few possible explanations for this. First, this section of the Orange Line is not only very close to the E branch of the Green line, but also runs parallel to it. This means that passengers can choose between the E branch and the Orange line, and it’s likely that one of the factors that goes into that decision is which line has the closest stations, so passengers are likely minimizing their walking distances during that section of the Orange Line. Another factor is that Orange Line stations in the center of the Orange Line are particularly close together, which could affect how far passengers need to walk to reach an Orange Line station.