The popularity of positioning devices has generated a large volume of vehicle driving data, making it possible to use historical data to predict the driving time of vehicles. Vehicle driving data consists of two parts: the sequence of road segments that the vehicle travels through, the departure time, the total length of the path, and other external information. The questions of how to extract sequence features in road segments and how to effectively fuse sequence features with external features become the key issues in predicting the travel time. To solve the aforementioned problems, a transformer-based travel time prediction model is proposed, which consists of two parts: a road segment sequence processing module and a feature fusion module. First, the road segment sequence processing module uses the self-attention mechanism to process the road segment sequence and extract the road segment sequence features. The model can not only fully consider the spatiotemporal correlation of road speeds between each road segment and other road segments, but also ensures the parallel input of data into the model, avoiding the low efficiency problem caused by sequential input of data when using recurrent neural networks. The feature fusion module fuses the road segment sequence features with external information, such as departure time, and obtains the predicted travel time. On this basis, the number of road segments connected by the intersection is determined by the upstream and downstream intersection features of the road segment, and the input model is combined with the road segment characteristics to further improve the prediction accuracy of the driving time. Comparative experiments with mainstream prediction methods on real data sets show that the model improves prediction accuracy and training speed, reflecting the effectiveness of the proposed method.