Eleven widely used crop simulation models (APSIM, CERES, CROPSYST, COUP, DAISY, EPIC, FASSET, HERMES, MONICA, STICS and WOFOST) were tested using spring barley (Hordeum vulgare L.) data set under varying nitrogen (N) fertilizer rates from three experimental years in the boreal climate of Jokioinen, Finland. This is the largest standardized crop model inter-comparison under different levels of N supply to date. The models were calibrated using data from 2002 and 2008, of which 2008 included six N rates ranging from 0 to 150 kg N/ha. Calibration data consisted of weather, soil, phenology, leaf area index (LAI) and yield observations. The models were then tested against new data for 2009 and their performance was assessed and compared with both the two calibration years and the test year. For the calibration period, root mean square error between measurements and simulated grain dry matter yields ranged from 170 to 870 kg/ha. During the test year 2009, most models failed to accurately reproduce the observed low yield without N fertilizer as well as the steep yield response to N applications. The multi-model predictions were closer to observations than most single-model predictions, but multi-model mean could not correct systematic errors in model simulations. Variation in soil N mineralization and LAI development due to differences in weather not captured by the models most likely was the main reason for their unsatisfactory performance. This suggests the need for model improvement in soil N mineralization as a function of soil temperature and moisture. Furthermore, specific weather event impacts such as low temperatures after emergence in 2009, tending to enhance tillering, and a high precipitation event just before harvest in 2008, causing possible yield penalties, were not captured by any of the models compared in the current study.