Title:
Information Extraction from Messy Data, Noisy Spectra, Incomplete Data, and Unlabeled Images

Thumbnail Image
Author(s)
Tian, Hongzhen
Authors
Advisor(s)
Zhang, Chuck
Mei, Yajun
Advisor(s)
Person
Person
Editor(s)
Associated Organization(s)
Series
Supplementary to
Abstract
Data collected from real-world scenarios are never ideal but often messy because data errors are inevitable and may occur in creative and unexpected ways. And there are always some unexpected tricky troubles between ideal theory and real-world applications. Although with the development of data science, more and more elegant algorithms have been well developed and validated by rigorous proof, data scientists still have to spend 50\% to 80\% of their work time on cleaning and organizing data, leaving little time for actual data analysis. This dissertation research involves three scenarios of statistical modeling with common data issues: quantifying function effect on noisy functional data, multistage decision-making model over incomplete data, and unsupervised image segmentation over imperfect engineering images. And three methodologies are proposed accordingly to solve them efficiently. In Chapter 2, a general two-step procedure is proposed to quantify the effects of a certain treatment on the spectral signals subjecting to multiple uncertainties for an engineering application that involves materials treatment for aircraft maintenance. With this procedure, two types of uncertainties in the spectral signals, offset shift and multiplicative error, are carefully addressed. In the two-step procedure, a novel optimization problem is formulated to estimate the representative template spectrum first, and then another optimization problem is formulated to obtain the pattern of modification $\mathbf{g}$ that reveals how the treatment affects the shape of the spectral signal, as well as a vector $\boldsymbol{\delta}$ that describes the degree of change caused by different treatment magnitudes. The effectiveness of the proposed method is validated in a simulation study. \textcolor{black}{Furtherly, in} a real case study, the proposed method \textcolor{black}{is used} to investigate the effect of plasma exposure on the FTIR spectra. As a result, the proposed method effectively identifies the pattern of modification under uncertainties in the manufacturing environment, which matches the knowledge of the affected chemical components by the plasma treatment. And the recovered magnitude of modification provides guidance in selecting the control parameter of the plasma treatment. In Chapter 3, an active learning-based multistage sequential decision-making model is proposed to assist doctors and patients to make cost-effective treatment recommendations when some clinical data are more expensive or time-consuming to collect than other laboratory data. The main idea is to formulate the incomplete clinical data into a multistage decision-making model where the doctors can make diagnostics decisions sequentially in these stages, and actively collect only the necessary examination data from certain patients rather than all. There are two novelties in estimating parameters in the proposed model. First, unlike the existed ordinal logistic regression model which only models a single stage, a multistage model is built by maximizing the joint likelihood function for all samples in all stages. Second, considering that the data in different stages are nested in a cumulative way, it is assumed that the coefficients for common features in different stages are invariant. Compared with the baseline approach that models each stage individually and independently, the proposed multistage model with common coefficients assumption has significant advantages. It reduces the number of variables to estimate significantly, improves the computational efficiency, and makes the doctors feel intuitive by assuming that newly added features will not affect the weights of existed ones. In a simulation study, the relative efficiency of the proposed method with regards to the baseline approach is 162\% to 1,938\%, proving its efficiency and effectiveness soundly. Then, in a real case study, the proposed method estimates all parameters very efficiently and reasonably. %It estimates all parameters simultaneously to reach the global optimum and fully considers the cumulative characteristics between these stages by making common coefficients assumption. In Chapter 4, a simple yet very effective unsupervised image segmentation method, called RG-filter, is proposed to segment engineering images with no significant contrast between foreground and background for a material testing application. With the challenge of limited data size, imperfect data quality, unreachable binary true label, we developed the RG-filter which thresholding the pixels according to the relative magnitude of the R channel and G channel of the RGB image. %And the other one is called the superpixels clustering algorithm, where we add another layer of clustering over the segmented superpixels to binarize their labels. To test the performance of the existed image segmentation and proposed algorithm on our CFRP image data, we conducted a series of experiments over an example specimen. Comparing all the pixel labeling results, the proposed RG-filter outperforms the others to be the most recommended one. in addition, it is super intuitive and efficient in computation. The proposed RG-filter can help to analyze the failure mode distribution and proportion on the surface of composite material after destructive DCB testing. The result can help engineers better understand the weak link during the bonding of composite materials, which may provide guidance on how to improve the joining of structures during aircraft maintenance. Also, it can be crucial data when modeling together with some downstream data as a whole. And if we can predict it from other variables, the destructive DCB testing can be avoided, a lot of time and money can be saved. In Chapter 5, we concluded the dissertation and summarized the original contributions. In addition, future research topics associated with the dissertation have also been discussed. In summary, the dissertation contributes to the area of \textit{System Informatics and Control} (SIAC) to develop systematic methodologies based on messy real-world data in the field of composite materials and healthcare. The fundamental methodologies developed in this thesis have the potential to be applied to other advanced manufacturing systems.
Sponsor
Date Issued
2021-08-03
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI