I am trying to build a keyword spotting system, and I choose this,a branch of sphinx 4 , as the foundation of my project.
It works properly with wav file (at least 70% with single keyword). But to save time in transmitting files from client to server, I decided to convert wav file to cepstrum in client side first then just transmit the cepstrum. This work is performed by Featurefiledumper.
But when loading the cepstrum into the original KWS system, the accuracy is horrible. I thought I just put some works in clients and it shouldn't affect the accuracy so much. The original KWS system can split each word into a proper block then recognize. After using spectrum as input, the system cannot even split every word properly. I think that is also the reason why it can not achieve high accuracy.
I want to find a way to save time in transmitting files and still have reasonable accuracy of KWS system. Is there anything I missed in configuration or there is another way to satisfy the need?
Here is the configuration on the client side:
<config>
<!-- ******************************************************** -->
<!-- The frontend configuration -->
<!-- ******************************************************** -->
<component name="cepstraFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>streamDataSource</item>
<item>preemphasizer</item>
<item>windower</item>
<item>fft</item>
<item>melFilterBank</item>
<item>dct</item>
</propertylist>
</component>
<component name="preemphasizer"
type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/>
<component name="windower"
type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower">
</componentcomponent>
<component name="fft"
type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"/>
<component name="melFilterBank"
type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank">
</component>
<component name="dct"
type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/>
<component name="streamDataSource"
type="edu.cmu.sphinx.frontend.util.StreamDataSource">
<property name="sampleRate" value="16000"/>
</component>
</config>
Here is the configuration on the server side:
<config>
<property name="logLevel" value="WARNING" />
<property name="absoluteBeamWidth" value="-1" />
<property name="relativeBeamWidth" value="1E-150" />
<property name="wordInsertionProbability" value="0.7" />
<property name="languageWeight" value="7" />
<property name="frontend" value="epFrontEnd" />
<property name="recognizer" value="recognizer" />
<property name="showCreations" value="false" />
<property name="outOfGrammarProbability" value="1E-10"/>
<property name="phoneInsertionProbability" value="1E-55"/>
<component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer">
<property name="decoder" value="decoder" />
</component>
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder">
<property name="searchManager" value="searchManager" />
</component>
<component name="searchManager" type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager">
<property name="logMath" value="logMath" />
<property name="linguist" value="FlatLinguist" />
<property name="pruner" value="trivialPruner" />
<property name="scorer" value="threadedScorer" />
<property name="activeListFactory" value="activeList" />
</component>
<component name="activeList" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="logMath" value="logMath" />
<property name="absoluteBeamWidth" value="${absoluteBeamWidth}" />
<property name="relativeBeamWidth" value="${relativeBeamWidth}" />
</component>
<component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner" />
<component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer">
<property name="frontend" value="${frontend}" />
</component>
<component name="FlatLinguist" type="edu.cmu.sphinx.linguist.KWSFlatLinguist.KWSFlatLinguist">
<property name="logMath" value="logMath" />
<property name="grammar" value="NoSkipGrammar" />
<property name="acousticModel" value="wsj" />
<property name="wordInsertionProbability" value="${wordInsertionProbability}" />
<property name="languageWeight" value="${languageWeight}" />
<property name="unitManager" value="unitManager" />
<property name="addOutOfGrammarBranch" value="true"/>
<property name="phoneLoopAcousticModel" value="WSJ"/>
<property name="outOfGrammarProbability" value="${outOfGrammarProbability}"/>
<property name="phoneInsertionProbability" value="${phoneInsertionProbability}"/>
<property name="dumpGStates" value ="true"/>
</component>
<component name="NoSkipGrammar" type="edu.cmu.sphinx.linguist.language.grammar.NoSkipGrammar">
<property name="dictionary" value="dictionary" />
<property name="logMath" value="logMath" />
<property name="addSilenceWords" value="false" />
</component>
<component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.AllWordDictionary">
<property name="dictionaryPath" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d" />
<property name="fillerPath" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/noisedict" />
<property name="addSilEndingPronunciation" value="false" />
<property name="wordReplacement" value="<sil>" />
<property name="unitManager" value="unitManager" />
</component>
<component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="wsjLoader" />
<property name="unitManager" value="unitManager" />
</component>
<component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath" />
<property name="unitManager" value="unitManager" />
<property name="location" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz" />
</component>
<component name="unitManager" type="edu.cmu.sphinx.linguist.acoustic.UnitManager" />
<!-- additions start-->
<component name="WSJ" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="WSJLOADER" />
<property name="unitManager" value="UNITMANAGER" />
</component>
<component name="WSJLOADER" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath" />
<property name="unitManager" value="UNITMANAGER" />
<property name="location" value="resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz" />
</component>
<component name="UNITMANAGER" type="edu.cmu.sphinx.linguist.acoustic.UnitManager" />
<component name="tidigits"
type="edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model">
<property name="loader" value="sphinx3Loader"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="sphinx3Loader"
type="edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader">
<property name="logMath" value="logMath"/>
<property name="unitManager" value="UNITMANAGER"/>
</component>
<!-- additions end -->
<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>streamCepstrumSource </item>
<item>BatchCMN </item>
<item>featureExtraction </item>
</propertylist>
</component>
<component name="audioFileDataSource" type="edu.cmu.sphinx.frontend.util.AudioFileDataSource" />
<component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker" />
<component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier" />
<component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter" />
<component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker" />
<component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer" />
<component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower">
</component>
<component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform">
</component>
<component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank">
</component>
<component name="dct" type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform" />
<component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN" />
<component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor" />
<component name="BatchCMN" type="edu.cmu.sphinx.frontend.feature.BatchCMN"/>
<component name="logMath" type="edu.cmu.sphinx.util.LogMath">
<property name="logBase" value="1.0001" />
<property name="useAddTable" value="true" />
</component>
<component name="streamCepstrumSource" type="edu.cmu.sphinx.frontend.util.StreamCepstrumSource">
<property name="sampleRate" value="16000"/>
</component>
</config>
==================================================================
Thanks to Nikolay. I've figured out the reason is different components (StreamDataSource and AudioFileDataSource) to deal with files.
But there is a problem, my client is Android system. It doesn't support javax.sound.sampled class. So it is impossible to use AudioFileDataSource on my client. StreamDataSource is a possible solution. But I have no idea why these two components lead to different feature sets.
Is there hint to make StreamDataSource generate the same result as AudioFileDataSource does?
Is there anything I missed in configuration or there is another way to satisfy the need?
Configuration is correct, you didn't miss anything.
Most likely you made a mistake in transfer function which you wrote yourself. You need to try to transfer the data as files first to make sure everything is the same. You can also dump values produced by CepstrumDataSource to verify they are in the expected range. You can use DataDumper component for that.