Available online at www.sciencedirect.com
ScienceDirect
Procedia CIRP 126 (2024) 429–434
17th CIRP Conference on Intelligent Computation in Manufacturing Engineering (CIRP ICME ‘23)
A meta-learning strategy based on deep ensemble learning for tool 
condition monitoring of machining processes
Jose Joaquin Peralta Abadia*, Mikel Cuesta Zabaljauregui, Felix Larrinaga Barrenechea
Mondragon Unibertsitatea, Loramendi Kalea, 4, 20500 Arrasate, Spain
* Corresponding author. E-mail address: jjperalta@mondragon.edu
Abstract
For Industry 4.0, tool condition monitoring (TCM) of machining processes aims to increase process efficiency and quality and lower tool 
maintenance costs. To this end, TCM systems monitor variables of interest, such as tool wear. In this paper, a novel meta-learning strategy based 
on ensemble learning and deep learning (DL) is proposed for tool wear monitoring and is compared with state-of-the-art DL models selected 
from recent literature, using open-access datasets as input validating its implementation in an industrial scenario. As a result of this study, a novel 
meta-learning strategy for tool wear monitoring with minimum error is proposed and validated.
© 2024 The Authors. Published by Elsevier B.V. 
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 17th CIRP Conference on Intelligent Computation in Manufacturing Engineering 
(CIRP ICME‘23)
Keywords: tool wear; deep learning; industry 4.0, tool condition monitoring; ensemble learning
1. Introduction several authors have proposed DL models trained with this 
dataset. 
Machining processes, such as milling, are widely used in Aghazadeh et al. (2018) implemented a convolutional neural 
manufacturing to achieve highly accurate machine parts and network (CNN) model in combination with spectral subtraction 
good surface integrity [1]. To satisfy the quality requirements of wavelet packets, using the current signals of the dataset,
of the finished piece, tool condition monitoring (TCM) systems achieving a root-mean-squared error (RMSE) of 0.088 mm [5].
are required to improve product quality, process dependability, More recently, Cai et al. (2020) presented a hybrid model based 
and production efficiency [2]. The primary aim of TCM is to on long short-term memory (LSTM) networks. The model was
identify the appropriate time to replace cutting tools. Changing trained with all signals and cutting conditions of the dataset,
tools too soon disrupts production times, and too late can cause using 4 cases for testing and the remaining 12 for training and 
damage to equipment, machines, and workpieces. achieving a RMSE of 0.0456 mm. The LSTM layer was used 
However, TCM of machining processes, and in particular for temporal encoding of features, and thereafter, a non-linear 
deep learning (DL)-based TCM, is yet to fully reach the shop regression network combined the temporal features obtained 
floor [2]. This is because DL models usually require big data from the LSTM with the cutting conditions to perform the 
for training, which is challenging in machining processes where predictions [6]. Another hybrid LSTM model, comprised of 
data is generally not publicly available or is unlabelled [3]. bidirectional LSTM and encoder-decoder LSTM layers, was
Aiming to mitigate the problem of data availability, open- proposed by Kumar et al. (2022). The model used time and 
access datasets have been published in the literature, such as the frequency features extracted from the vibration signals of the 
NASA Ames/UC Berkeley milling dataset [4]. As a result, dataset, achieving a RMSE of 0.0364 mm [7]. Finally, Pillai and 
2212-8271 © 2024 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 17th CIRP Conference on Intelligent Computation in Manufacturing Engineering 
(CIRP ICME‘23)
10.1016/j.procir.2024.08.391
430 Jose Joaquin Peralta Abadia  et al. / Procedia CIRP 126 (2024) 429–434
Vadakkepat (2022) presented a temporal multivariate 3D between runs at variable intervals. Specifically, VB was not 
convolutional network model, trained with 3D features from the recorded for all runs. Moreover, the degree of tool wear 
signals obtained from kernel-based transformations, and surpassed the manufacturer recommended VB limit in some 
achieving a RMSE of 0.0424 mm [8]. cases. 
Although good performance has been obtained from the The experimental conditions of the cases are presented in 
models in scientific experiments, the margin of error is still not Table 1, and include two values for DOC (1.5 and 0.25 mm), 
acceptable for industrial implementations. Furthermore, the two values for feed rate (0.5 and 0.25 mm/rev), and two 
selection of signals from the dataset to be used as inputs material types (1-cast iron and 2-stainless steel). The cutting 
requires a systematic approach. Therefore, the potential of tools used were KC710 inserts, the cutting speed was 200 
applying DL to TCM requires further research, such as meta- m/min (or 826 rev/min), and the workpiece size was 483 mm x 
learning strategies that combine DL models with ensemble 178 mm x 51 mm. Eight combinations of cutting conditions 
learning techniques [9]. were defined, and each combination was repeated a second time 
In this paper, a novel meta-learning strategy based on deep with a new set of cutting tools.
ensemble learning (DEL) is proposed for tool wear monitoring. 
The strategy was compared with state-of-the-art DL models Table 1. Experimental conditions of the NASA Ames/UC Berkeley dataset.
selected from recent literature, using the NASA Ames/UC Case DOC Feed Material Case DOC Feed Material
Berkeley open-access dataset as input. As such, the contribution rate rate
of this paper is twofold: 1 1.5 0.5 1 9 1.5 0.5 1
2 0.75 0.5 1 10 1.5 0.25 1
• Meta-learning strategy: A novel meta-learning strategy 3 0.75 0.25 1 11 0.75 0.25 1
based on deep ensemble learning (DEL) compared against 
state-of-the-art DL models selected from recent literature, 4 1.5 0.25 1 12 0.75 0.5 1
proving a superior prediction performance. 5 1.5 0.5 2 13 0.75 0.25 2
• An analysis of the signals from the NASA Ames/UC 6 1.5 0.25 2 14 0.75 0.5 2
Berkeley dataset to identify ideal signals to be used as 7 0.75 0.25 2 15 1.5 0.25 2
inputs for DL learning models. The signals are analysed, 8 0.75 0.5 2 16 1.5 0.5 2
cleaned, and augmented. Then, four combinations of 
signals (all signals, current and acoustic emission signals, 
current signal, and vibration signal) are compared in Pearson correlation coefficient was applied to the dataset, 
relation to their effect on DL model performance. obtaining the correlations between the dataset features, and is depicted as a correlation matrix in Fig. 1. Of the signals, smcDC
The reminder of this paper is structured as follows. Section reported the highest correlation with VB, and also presented a 
2 describes the open-access dataset used in this study. Section high correlation with AE signals. Fig. 2 illustrates the VB 
3 describes the methodology followed to implement the meta- histogram of the dataset, in which an exponential distribution is
learning strategy. Thereafter, Section 4 presents results and observed. VB progressed slowly in both the break-in and 
discussion. Finally, Section 5 presents conclusions and outlook regular wear stages of the cutting tool. The VB curve increased 
on future work. exponentially in the high and critical wear stages, until the toolwas no longer usable.
2. Dataset description
The NASA Ames/UC Berkeley open-access dataset [4] was 
used in this study as input for training the meta-learning 
strategy based on DEL. The dataset encompasses 16 face 
milling experiments that were performed on a milling machine 
under varying cutting conditions. Three types of sensors, i.e., 
acoustic emission (AE) sensors, vibration sensors, and current 
sensors were employed to collect data with a sampling rate of 
250 Hz. Specifically, the sensors collected signals including
spindle motor current AC (smcAC), spindle motor current DC 
(smcDC), table vibration (vib_table), spindle vibration 
(vib_spindle), table AE (AE_table), and spindle AE 
(AE_spindle). In addition, the dataset was enriched with 
process information, such as case number, experimental run 
count, tool wear (VB), experiment duration, and cutting 
conditions. Cutting conditions included depth of cut (DOC), 
feed rate, and material type. 
A total of 167 runs were performed for approximately 36 s Fig. 1. Correlation matrix of the NASA Ames/UC Berkeley dataset.
each, containing 9000 measurement points per run. The number 
of runs per case varied according to the extent of VB assessed 
Jose Joaquin Peralta Abadia  et al. / Procedia CIRP 126 (2024) 429–434 431
○ Case 8 - Run 4.
○ Case 10 - Runs 2 and 10.
○ Case 11 - Runs 10 and 21.
○ Case 12 - Runs 3 and 7.
○ Case 13 - Runs 3, 6, 8, 9, 13, and 14.
○ Case 14 - Runs 1, 2, 3, 6, and 10.
○ Case 15 - Runs 1, 2, 3, 4, 6, and 7.
3.2. Machine learning baseline models
Six ML models were trained with the input data as baseline 
models: (i) decision tree, (ii) random forest, (iii) support vector 
machine (SVM), (iv) gradient boosting, (v) XGBoost, and (vi) 
k-nearest neighbours (kNN). For the sake of brevity, detailed 
descriptions of the algorithms are omitted but can be found in 
[10,11].
Fig. 2. VB histogram of the NASA Ames/UC Berkeley dataset. The feature extraction methodology proposed in [12] was 
adopted to train the baseline ML models. Time domain, 
3. Methodology frequency domain, and time-frequency domain features were 
extracted, and are presented in Table 2, with a total of 54 
The methodology to achieve TCM using the novel meta- extracted features. A more detailed description of the extracted
learning strategy was comprised of three steps. First, data pre- features can be found in [12]. Afterwards, the features were 
processing was performed on the NASA Ames/UC Berkeley normalized with z-normalization using the standard z-score,
open-access dataset to clean and prepare it for training the DEL calculated as z = (x - μ) / σ, where μ is the mean of the feature, 
model. Second, machine learning (ML) models were developed x is the value of the feature, and σ is the standard deviation of 
and implemented as baseline models. Finally, the meta-learning the feature.
strategy based on DEL was developed and implemented. 
Table 2. Extracted features of the time, frequency, and time-frequency 
3.1. Data pre-processing domains.
Domain Feature
The dataset containes measurements collected during entry, Time RMS
regular, and exit cuts of the experiments. In this study, the entry Variance
and exit cut portions of the signals were omitted, focusing only Maximum
on the regular cut portion of the machining process. Kurtosis
Furthermore, since some cases did not record VB, linear 
interpolation was performed to use all data available. Thereafter, Skewness
signals for each run were evaluated. Data acquired in eight runs Peak-to-peak
were corrupted or had undocumented events and were omitted Frequency Spectral skewness
in this study, resulting in 159 runs for training and testing. In Spectral kurtosis
addition, 22 runs had signals with noisy values, which could Time-frequency Wavelet energy
have a negative impact on the prediction capabilities of the 
meta-learning strategy. For predicting tool wear, the global 
behaviour of the signal is more important than localized events Given the high quantity of features and the inherent high 
(e.g., chipping). Therefore, a moving average with size 20 was correlation among them, a dimensionality reduction approach 
applied to average out the noisy values, while maintaining the was required. To this end, the principal component analysis
global behaviour of the signals. The following are the two (PCA) technique was used [13]. The variance of the dataset that 
groups of runs that were treated: each component represents was analysed to determine the 
number of principal components to be chosen. At least 95% of 
• Omitted variance was considered to properly represent the dataset [12].
○ Case 1 - Runs 16 and 17: VB lowers after run 15.
○ Case 2 - Run 5: Missing data in AE_table. 3.3. Meta-learning strategy based on deep ensemble learning
○ Case 2 - Run 6: Corrupt data in AE_spindle.
○ Case 7 - Run 4: Corrupt data in AE_table. Ensemble learning trains multiple ML or DL models, called 
○ Case 8 - Run 3: Missing data in AE_table. base learners, to output several weak predictions from the same 
○ Case 12 - Run 1: Corrupt data in all signals problem. The predictions are generally combined using voting 
○ Case 12 - Run 12: Undocumented event in all signals. and averaging mechanisms, which results in better performance 
• Noise than those of the models by themselves [14]. Recently, meta-
○ Case 3 - Run 9. learning has been proposed for combining predictions, to 
○ Case 7 - Run 8. improve the performance of ensemble learning. Meta-learning 
consists of learning from the outputs of each of the learners and 
432 Jose Joaquin Peralta Abadia  et al. / Procedia CIRP 126 (2024) 429–434
Fig. 3. Meta-learning model architecture.
making predictions based on the outputs combined. Hence, well selected from recent literature [5–8]. Since some of the DL 
performing base learners help offset those that perform badly models were trained only with either the vibration or the current 
for some problems, and vice versa for other problems. The most signals, the use of smcDC as single input, as well as vib_spindle,
commonly used meta-learning strategy is stacked were also explored for training the meta-learning strategy.
generalization (or stacking), which learns how to best combine Consequently, four strategies for training meta-learning models 
the outputs of the base learners by using another ML or DL with varied input data were explored: (i) all sensor signals, (ii) 
model [15]. AE_table, AE_spindle and smcDC sensor signals, (iii) smcDC 
A heterogeneous DEL approach was implemented, sensor signal, and (iv) vib_spindle sensor signal.
comprised of LSTM, bidirectional gated recurrent unit 
(BiGRU), and CNN models as base learners. Moreover, a deep 4. Results and discussion
neural network (DNN) was used as meta-learner, combining the 
predictions from the base learners. Six ML baseline models and a meta-learning model based 
First, the base learners were trained with the signals as input on DEL were trained and tested. For the baseline models, time, 
data. A DL stacking meta-learner was subsequently defined and frequency, and time-frequency domain features were extracted
trained, where the trained base learners were used as initial and z-normalized, for a total of 54 features (nine features per 
layers. As a result, the weak predictions were the input features signal). PCA was selected for dimensionality reduction and the 
of the meta-learner. Fig. 3 depicts the architecture of the meta- explained variance of the components is presented in Fig. 4. 
learning model.
The model was evaluated using all available signals, owing 
to the benefits of sensor fusion [2]. Moreover, other 
combinations were explored as well. Fig. 1 shows that smcDC 
had the highest correlation to VB, followed by both AE signals. 
In general, AE signals have high accuracy and resolution and 
have proven to be reliable for detecting events in machining 
processes [1]. Therefore, a combination of smcDC with 
AE_table and AE_spindle signals was explored to evaluate the 
performance of the approach with less signals but with a 
relatively high correlation among them. The performance of the 
approach was compared with state-of-the-art DL models Fig. 4. Explained variance of PCA of the NASA Ames/UC Berkeley dataset.
Jose Joaquin Peralta Abadia  et al. / Procedia CIRP 126 (2024) 429–434 433
The first 25 components were selected, as they represent 95% 
of the variance in the dataset. Table 3. Hyperparameters and performance metrics of baseline models.
Table 3 presents the hyperparameters of the baseline models, 
as well as performance metrics during testing. The Model Hyperparameters R2 RMSE MAE
hyperparameters were obtained using a randomized search Decision Default parameters 0.7225 0.1371 0.0534
cross validation method. Coefficient of determination (R2), tree
mean absolute error (MAE) and root-mean-squared error SVM C = 9.8143, ε = 0.0012, 0.8639 0.0961 0.0595
(RMSE) were chosen as performance metrics. The performance Kernel = RBF
of the models is best when closest to one for R2 and closest to Random Max. depth = 20, No. 0.8471 0.1018 0.0610
zero for MAE and RMSE. The best performing model was kNN, forest estimators = 437
followed closely by XGBoost. However, the scores indicate Gradient Learning rate = 0.0975, 0.8570 0.0984 0.0623
that the models could have an error in average of 0.0739 mm in boosting Max. depth = 13, No. 
its prediction. In industrial scenarios, a maximum tool wear of estimators = 169
0.3 mm is recommended by manufacturers. Thus, the error in XGBoost Learning rate = 0.0098, 0.8952 0.0843 0.0478
the predictions represents a 25% of the industrial tool life and Max. depth = 12, No. 
would not be acceptable in shop floors. estimators = 577, Min. child weight = 4
After training and testing the baseline models, the meta-
learning model based on DEL was implemented. As with the kNN No. neighbours = 2, 0.9195 0.0739 0.0224Weights = Distance
baseline models, the data was z-normalized. Furthermore, since 
DL models require big data, a sliding window approach was performed better than the CNN model with combinations of 
adopted to augment the dataset. The sliding window was of size signals. However, when using individual signals, the LSTM 
250 (one second) and stride 25 (1/10 of a second), increasing model was the worst predictor.
the dataset size from 166 datapoints with a sequence length of The model outperformed the results of the two reference 
5400, to 31323 datapoints with a sequence length of 250. models that used all signals in the dataset, with an RMSE of 
All strategies shared the same model hyperparameters. The 0.0145 mm and an R2 score of 0.9967. Moreover, it is shown 
LSTM, BiGRU, and DEL models were trained for 1000 epochs, that the LSTM and BiGRU base learners also outperformed the 
with an early stop after 50 epochs without model improvement. reference models with RMSE of 0.0207 and 0.0149 mm, 
The CNN model required more epochs to generalize knowledge, respectively. Thus, the efficiency of the data cleaning and 
so 4000 epochs with an early stop after 200 epochs were defined. augmentation process before training DL models was proven. 
All models used the ADAM optimizer with a learning rate of The performance results for the meta-learning model when 
0.0001 and RMSE as loss function. To avoid overfitting, a trained with smcDC and AE signals showed a bigger margin of 
dropout of 10% and L2 regularization factor of 0.00001 were error with an RMSE of 0.0473 mm and an R2 score of 0.9660. 
implemented. The dataset was split stochastically into 48% for Nevertheless, the model required less inputs and the results are 
training, 12% for validation, and 40% for testing. The split was comparable to the reference models that use all signals. Finally, 
made stochastically to account for the variability in cutting the results when using individual signals were underperforming.
conditions and tools that may occur in industrial shop floors. To achieve good results, the architecture of the models was 
The performance results of the meta-learning model grouped expanded, adding two extra layers to the base learners. With 
by input data strategies, as well as a comparison with state-of- smcDC, the model had an RMSE of 0.1699 mm and an R2 score 
the-art DL models, is presented in Table 4. Results in the table of 0.5715, and, with vib_spindle, the model had an RMSE of
prove the meta-learning strategy benefits, improving the quality 
of the predictions by combining the predictions of the base 
learners. For the base learners, the LSTM and BiGRU models 
Table 4. Performance results of the meta-learning model. Best performing models are highlighted in bold.
All signals DC and AE signals DC signal Vibration signals
Model R2 RMSE MAE R2 RMSE MAE R2 RMSE MAE R2 RMSE MAE
LSTM 0.9935 0.0207 0.006 0.9454 0.0601 0.0225 0.3606 0.2078 0.1435 0.4712 0.1910 0.1348
CNN 0.9611 0.0507 0.0308 0.8597 0.0963 0.0632 0.5114 0.1815 0.1119 0.5778 0.1707 0.1169
BiGRU 0.9966 0.0149 0.0042 0.9630 0.0494 0.0229 0.3650 0.2067 0.1433 0.7997 0.1176 0.0748
Meta-learning 0.9967 0.0145 0.0055 0.9660 0.0473 0.0220 0.5715 0.1699 0.1048 0.8072 0.1130 0.0714
CNN with spectral 0.088
subtraction [5]
LSTM with 0.0456 0.0322
process 
information [6]
Hybrid LSTM [7] 0.9837 0.0364 0.0258
TM3C-KT [8] 0.0424
434 Jose Joaquin Peralta Abadia  et al. / Procedia CIRP 126 (2024) 429–434
studied to improve the performance of the approach when using 
individual signals as inputs.
Acknowledgements
This project has received funding from the European 
Union’s Horizon 2020 research and innovation program under 
the Marie Skłodowska-Curie grant agreement No 814078 and 
by the Department of Education, Universities and Research of 
the Basque Government under the projects Ikerketa Taldeak 
(Grupo de Ingeniería de Software y Sistemas IT1519-22 and 
Grupo de investigación de Mecanizado de Alto Rendimiento 
IT1443-22).
References
Fig. 5. Tool wear curve of the test data set, ordered by the ground truth VB [1] C.H. Lauro, L.C. Brandão, D. Baldo, R.A. Reis, J.P. Davim, Monitoring 
value. and processing signal applied in machining processes - A review, 
0.1130 mm and an R2 score of 0.8072. Consequently, further Measurement. 58 (2014) 73–86.[2] R. Teti, D. Mourtzis, D.M. D’Addona, A. Caggiano, Process monitoring 
research efforts should be given to improve the performance of of machining, CIRP Annals. 71 (2022) 529–552.
the meta-learning model when using individual signals. [3] D.Y. Pimenov, A. Bustillo, S. Wojciechowski, V.S. Sharma, M.K. 
Fig. 5 presents a comparison of the VB curve for both the Gupta, M. Kuntoğlu, Artificial intelligence systems for tool condition 
ground truth values, as well as the predicted values of the meta- monitoring in machining: analysis and critical review, J Intell Manuf. 
learning model, when using all inputs. The data was ordered by (2022) 1–43.[4] A. Agogino, K. Goebel, Milling data set, (2007).
ground truth value, as all cases and runs were augmented and [5] F. Aghazadeh, A. Tahan, M. Thomas, Tool condition monitoring using 
shuffled stochastically during splitting. It may be observed that spectral subtraction and convolutional neural networks in milling 
the model predicted values very close to the ground truth process, International Journal of Advanced Manufacturing Technology. 
throughout the wear curve, proving the effectiveness and good 98 (2018) 3217–3227.
performance of the approach when using sensor fusion. [6] W. Cai, W. Zhang, X. Hu, Y. Liu, A hybrid information model based on long short-term memory network for tool condition monitoring, J Intell 
Manuf. 31 (2020) 1497–1510.
5. Summary and conclusions [7] S. Kumar, T. Kolekar, K. Kotecha, S. Patil, A. Bongale, Performance 
evaluation for tool wear prediction based on Bi-directional, Encoder–
In this paper, a tool wear monitoring approach based on Decoder and Hybrid Long Short-Term Memory models, International 
meta-learning using deep ensemble learning has been presented. Journal of Quality and Reliability Management. 39 (2022) 1551–1576.[8] S. Pillai, P. Vadakkepat, Deep learning for machine health prognostics 
The meta-learning approach is proposed for improving using Kernel-based feature transformation, J Intell Manuf. 33 (2022) 
performance when predicting tool wear in machining. The 1665–1680.
approach uses deep ensemble learning to combine the outputs [9] P. Cawood, T. Van Zyl, Evaluating State-of-the-Art, Forecasting 
of multiple deep neural network models, i.e., LSTM, CNN, and Ensembles and Meta-Learning Strategies for Model Fusion, Forecasting. 
BiGRU models, resulting in improved accuracy and robustness. 4 (2022) 732–751.[10] S. Russel, P. Norvig, others, Artificial intelligence: a modern approach, 
The meta-learning approach has been validated using the Pearson Education Limited London, 2013.
NASA Ames/UC Berkeley open access milling dataset, which [11] C. Bentéjac, A. Csörgő, G. Martínez-Muñoz, A comparative analysis of 
was augmented and denoised. A combination of all the signals, gradient boosting algorithms, Springer Netherlands, 2021.
smcDC and AE signals, and individual signals (smcDC and [12] J. Wang, J. Xie, R. Zhao, L. Zhang, L. Duan, Multisensory fusion based 
vib_table) were used for the validation tests. The best results virtual tool wear sensing for ubiquitous manufacturing, Robot Comput Integr Manuf. 45 (2017) 47–58.
were obtained when using all the signals, substantially [13] R. Vidal, Y. Ma, S.S. Sastry, Generalized Principal Component Analysis, 
outperforming state of the art DL-based reference models and Springer-Verlag New York, 2016.
proving the benefits of sensor fusion. Future work will involve [14] Z.-H. Zhou, Ensemble methods: foundations and algorithms, CRC press, 
investigating the ability of the meta-learning approach to detect 2012.
tool wear in other machining datasets. Furthermore, data pre- [15] L. Rokach, Pattern classification using ensemble methods, World Scientific, 2010.
processing and feature extraction techniques, as well as DL 
model hyperparameter tuning and architectural changes, will be