Home

Behind the scenes with MBTA data.

The MBTA is constantly working to improve its data quality, especially the data generated by our train tracking systems that affect our customer-facing feeds. Better data quality means more accurate real-time customer information and measurements of our performance that more accurately reflect passenger experiences. But this means that there will be discontinuities in our performance measures based on improvements to the underlying data, rather than changes in performance. This post explains changes made on September 12, 2018 to subway data that impact our performance measures.

In previous posts, we’ve explained how the MBTA tracks vehicles both in general and on the Green Line. We have vehicle tracking systems on almost all our vehicles, with different tracking systems for the different modes (heavy rail, light rail, bus, and commuter rail). These vehicle tracking systems produce real-time data feeds (some built by vendors, some built in-house) that are used to manage our service, measure our performance, view vehicle locations in real-time, and provide passengers with predictions of upcoming vehicle arrivals. We use a data fusion engine to combine the data feeds from each of these systems into one consolidated real-time feed to make it easier for our developers to work with our data. This consolidated feed is also the source of data for our performance tracking system that provides the data published on the MBTA Back on Track dashboard.

The existing software that produced the real-time data feed for heavy and light rail vehicles was a legacy codebase that was built in-house. It was functional for the basic application of providing subway predictions, but design decisions made during the initial development made it difficult or impossible to add new features or improve existing data quality. We have been working to replace the software in order to add new features and improve the accuracy of our locations and predictions. We went live with the Green Line portion of the update on February 8, 2018 and went live with the software for heavy rail (Red, Orange, and Blue lines) on September 12, 2018.

Some of the new and improved features include:

  • Inclusion of location information for trains at terminal stations
  • The flexibility to handle different types of shuttle-bus diversions, including ones that are created on-the-fly in response to incidents
  • Improvements to the accuracy of predictions for trains that are at terminal stations
  • General improvements to the accuracy of locations and predictions throughout the lines

Our previous heavy rail data feed did not include location information for trains at the terminal stations, and the passenger-weighted metrics did not take into account the passengers who were traveling to or from the end of the line. With the inclusion of location information of trains at terminal stations for heavy rail, we now have accurate arrival times at terminal stations and can include these passengers in our metrics. Passenger weighted reliability metrics for Red, Orange, and Blue lines will more accurately reflect the customer experience. This will result in a decrease in the reliability metrics for the heavy rail lines between 0-2% depending on the line and the day.

In addition to the new data feed, we have built a new data fusion engine called Concentrate to combine the new real-time data feed for heavy rail and light rail with the feeds for commuter rail and bus into one consolidated feed for all modes. Concentrate enables higher-capacity, more frequent sharing of all MBTA real-time data. Concentrate went live for providing real-time information to third-party developers and customers in March 2018. We have been rolling it out for use as a source of data for internal systems over the last few months. We began using the data from Concentrate for the performance tracking system on September 12, 2018. It improves the update frequency of real-time information by up to 30 seconds in some cases and results in more accurate arrival and departure times throughout the lines. This was especially important for the Green Line where there are many stations that are close together and trains arrive frequently where even a few seconds delay replicated over the course of the day could result in many missing events.

Missing events create more problems for the Green Line because we are not currently able to identify them and remove false long wait times on the Green Line (as we do for the Red, Orange, and Blue lines) due to complexities with the Green line schedule and other data limitations (described more here). Therefore, improving the accuracy of stop events (arrivals and departures) for the Green Line is very important in improving the accuracy of our passenger wait time metric. With Concentrate, passenger weighted reliability metrics for the Green line will more accurately reflect the customer experience. This will result in an increase in the reliability metrics for the Green line between 2-7% depending on the branch and the day.

We will have to take these methodological changes into consideration when we are looking at heavy rail and light rail performance trends over time so that we can accurately attribute when increases in the wait-time measure are due to data improvements and when they are due to service improvements.