We present a fast algorithm for automatic segmentation of white matter fibers from tractography datasets based on a multi-subject bundle atlas. We describe a sequential version of the algorithm that runs on a desktop computer CPU, as well as a highly parallel version that uses a Graphics Processing Unit (GPU) as an accelerator. Our sequential implementation runs 270 times faster than a C++/Python implementation of a previous algorithm based on the same segmentation method, and 21 times faster than a highly optimized C version of the same previous algorithm. Our parallelized implementation exploits the multiple computation units and memory hierarchy of the GPU to further speed up the algorithm by a factor of 30 with respect to our sequential code. As a result, the time to segment a subject dataset of 800,000 fibers is reduced from more than 2.5 hours in the Python/C++ code, to less than one second in the GPU version.