標題: 深層分解及變異學習於訊號分離之研究
Deep Factorized and Variational Learning for Source Separation
作者: 郭冠廷
簡仁宗
Kuo, Kuan-Ting
Chien, Jen-Tzung
電機工程學系
關鍵字: 深層學習;訊號分離;回響消除;類神經網路;矩陣分解;變異型自編碼網路;deep learning;source separation;dereverberation;neural network;matrix factorization;variational auto-encoder
公開日期: 2016
摘要: 深層機器學習在影像辨識、語音辨識及訊號分離等不同系統中已證實獲得相當優異的系統效能,深層類神經網路(deep neural network)儼然成為一個新的研究趨勢,用來解決不同的回歸或分類的問題。類神經網路也延伸發展成考慮時序性的資料之間關係的循環類神經網路(recurrent neural network)。本篇論文提出兩套監督式(supervised)深層回歸模型並實現於單聲道語音訊號分離, 一個是矩陣分解型類神經網路(matrix factorized neural network, MFNN),另一個是變異型循環類神經網路(variational recurrent neural network, VRNN)。 一般而言,單聲道訊號分離是透過類神經網路從混合訊號的時域頻譜矩陣(temporal-frequency spectral matrix)中去估測組成來源訊號的基底元素,再將來源訊號從中分離出來。我們將這個訓練模型視為一個監督式的回歸網路,從給定一組矩陣形式的混合訊號頻譜及它們對應分離後的頻譜來訓練這個模型,並用來預測未知混合訊號的分離結果。在傳統的深層學習中,輸入矩陣(input matrix)通常會把它展開成向量(vector)形式。在這種方式下,相鄰的時間(temporal)或頻率(frequency)的關聯資訊會遺失在訓練的過程中使得分離的效果受限制。本篇論文發展一個矩陣分解類神經網路來處理這個限制。我們將輸入資料保有原本矩陣的結構,並且在訓練過程中都以矩陣的方式來傳遞,透過塔克拆解(Tucker decomposition)取出時間和頻率軸上的潛在因子。我們也發展矩陣形式的倒傳遞演算法來訓練整個類神經網路的參數以及每一層矩陣分解的過程。為有效降低學習表示法的多餘資訊,本論文進一步將基底正交限制加諸於矩陣分解類神經網路的最佳化程序中。 另一方面,我們也發展一個新的隨機學習(stochastic learning)模型應用在循環類神經網路的訊號分離。我們發展的變異型循環類神經網路是從生成性隨機網路(generative stochastic network)和變異型自編碼網路(variational auto-encoder)的觀點下建立而成。這個模型將每個時間下的隱藏狀態(hidden state)導入變異型貝氏推論(variational Bayesian inference)。然而在推論過程中,我們會遇到不可解析的事後機率(posterior)和期望值(expectation),因此我們利用一個推論型網路(inference network)來處理這個問題,並且透過混合訊號以及對應組成的各別來源訊號來訓練。我們發展出一個新穎的監督式變異型循環類神經網路(supervised variational recurrent neural network)包含了生成性(generative)與鑑別性(discriminative)的訓練過程。在此潛在變數模型下之神經參數即透過最大化邊緣相似性(marginal likelihood)的變異性下界(variational lower bound)來估測。和傳統的循環類神經網路相比,我們提出的隨機網路模型能利用隨機變數去模擬網路隱藏狀態中的不確定性,有助於我們分析類神經網路之變異性。在實驗評估中,我們使用TIMIT及2014 REVERB Challenge資料庫於語音訊號分離和回響消除,並利用本篇論文提出來的兩種模型去證實其效果。
Deep learning has been successfully developed to achieve high performance in different applications ranging from image recognition to speech recognition and source separation. Learning based on deep neural network (DNN) is now a new trend towards solving different problems on classification as well as regression. Such a feedforward DNN is also extended to recurrent neural network (RNN) which is feasible to learn temporal patterns from sequential data. This paper conducts the supervised regression learning for monaural source separation. Two deep models are proposed. One is the matrix factorized neural network (MFNN) while the other is the variational recurrent neural network (VRNN). In general, the so-called deep single-channel source separation is performed by estimating the individual source signals from a set of mixed signals where each mixed signal, represented by a temporal-frequency spectral matrix, is de-mixed according an DNN or RNN. This learning model is treated as a supervised regression network which is learned from a set of training matrices with their corresponding de-mixed matrices and then applied for prediction or separation of an unknown mixed signal or matrix. Conventionally, the input matrices are flattened to be vectors for deep learning. The information from temporal correlation or frequency neighboring will be lost so that the separation performance is constrained by using the vector-based deep model. This study presents an MFNN to deal with this constraint. This model preserves the structural information in temporal-frequency matrix in DNN forward computation and extracts the latent factors in temporal and spatial horizons in each layer via Tucker decomposition. The factorization and neural parameters are jointly trained under a unified objective. A matrix error backpropagation algorithm is formulated to train the factor matrices in different layers. To reduce the redundancy in learning representation, the orthogonality constraint is further imposed in optimization procedure. On the other hand, we present a new stochastic learning machine for RNN source separation. The proposed VRNN is constructed from the perspectives of generative stochastic network and variational auto-encoder. This VRNN incorporates the hidden state of each time step into variational Bayesian inference. However, in the inference procedure, we face the optimization problem with an intractable posterior and intractable expectation. An inference network conditioned on the variational distribution is trained from a set of mixed signals and their individual source targets. A novel supervised VRNN is proposed through a generative and discriminative procedure. The neural parameters under such a latent variable model are accordingly estimated by maximizing the variational lower bound of marginal likelihood. Beyond the traditional RNN, the proposed VRNN provides a stochastic point of view which accommodates the uncertainty in hidden states and facilitates the analysis of RNN topology. The masking function is further applied in the network outputs of separation models. In the experiments, we evaluate the proposed MFNN and VRNN for speech separation and de-reverberation based on the tasks using TIMIT database and 2014 REVERB Challenge dataset, respectively.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070350725
http://hdl.handle.net/11536/139683
顯示於類別:畢業論文