Behind the scenes with MBTA data.

How one-time events like parades and the calendar influences ridership numbers

The MBTA is constantly working to improve its data quality, especially the data generated by our train tracking systems. Better data quality means more accurate customer information and measurements of our performance that more accurately reflect passenger experiences.  But this means that there will be discontinuities in our performance measures based on improvements to the underlying data, not based on changes in performance. This post explains a change we just made to Green Line data that impacts our performance measures.

Our data comes from many underlying systems and as we upgrade those systems, we are able to release new data feeds. In previous posts, we’ve explained how the MBTA tracks vehicles both in general and on the Green Line. The existing software is a legacy codebase that was built in-house. It is functional for the basic application of providing subway predictions, but design decisions made during the initial development have made it difficult or impossible to add new features or improve existing data quality.

We are in the process of replacing the software in order to add new features and improve the accuracy of our predictions. We went live with the Green Line portion of the update on February 8, and are continuing to work on updating the software for heavy rail (Red, Orange and Blue lines).

Some of the new and improved features include:

* The flexibility to handle different types of shuttle-bus diversions, including ones that are created on-the-fly in response to incidents

* Improvements to the accuracy of our predictions for trains that are at terminal stations

* General improvements to the accuracy of predictions throughout the lines

We began this project last year when we wrote a new application to output real-time data for the Mattapan Trolley in order to add countdown signs and provide locations and predictions data for the Mattapan Trolley to app developers. Our next step was to expand this application to cover the Green Line as well.

As we prepared for the launch of this new source of Green Line data, we discovered a bug in the previous version of the software. This bug did not affect the location and prediction information going out to customers about Green Line trains, but it did mean that frequently, departures and arrivals of trains at terminal stations were not recorded by our performance system. Because of this, we found many erroneous “long wait times” for passengers traveling to or from terminal stations. The new software significantly improves the accuracy of the arrivals at and departures from terminal stations in the performance tracking system.

As described in the blog post ‘February Green Line Reliability Data’, we are not able to identify missing events and remove false long wait times on the Green Line as we do for the Red, Orange, and Blue lines due to complexities with the Green line schedule and other data limitations. Therefore, improving the accuracy of stop events (arrivals and departures) for the Green Line is very important in improving the accuracy of our passenger wait time metric. We are also continuing to improve our performance tracking system to better differentiate between long wait times that are real and those that are due to missing data caused by bad GPS reads and the data that is missed during the processing time polling cycles between systems.

We deployed this new software on February 8, 2018, which means that the reported percentage of customers experiencing wait times longer than a scheduled headway for the Green Line for dates starting February 8, 2018 will more accurately reflect the customer experience. This will result in an increase in the reliability metrics for the Green Line between 1-6% depending on the branch and the day. 

We will have to take this methodological change into consideration when we are looking at Green Line performance trends over time so that we can accurately attribute when increases in the wait-time measure are due to data improvements and when they are due to service improvements.

Investigation into bus ridership changes using regression analysis.