Non-Concave emission function: Viterbi on Hidden Markov Model?

Hi! I recently met a problem when I am using a non-concave emission function in Hidden Markov Model which is not allowed in the EM algorithm. I wonder whether Viterbi can avoid such a problem, allowing me to use a non-concave emission function. :thinking: Or some other more “general” methods?

Can you elaborate a little bit more about your model? I am not too familiar with HMMs so I do not quite understand what a non-concave emission would mean in that context. Do you mean that the function that you obtain in the E step of the EM-algorithm is not concave/convex so finding the next parameter iterate is not guaranteed with a standard convex optimization algorithm?

Agreed with Stefan it would be great to have more details about what you mean by non-concave :slight_smile:

Do you mean that the CDF of the emission distribution is non-concave, i.e. that the emission distribution function has several modes ?
Or that the emission distribution is non-Gaussian and not well modeled by other known distributions ?
In any case I can suggest you a work we have done some years ago about HMM with normalizing flow based emission models (so that it can model arbitrary distributions) and trained with a combination of EM and back-prop:

1 Like

Thank you all!
@StefanHeyder I mean the observation/emission function f=P(output|latent state, some other parameters). Calculating derivative of function f with respect to “some other parameters” to derive Hessian matrix. Which is not The Hessian matrix indicates the function is not concave. I guess your interpretation is as same as my ambiguous question :melting_face: .
@antoinehonore I currently try to use stochastic gradient descent rather than EM. I will have a look and try your method!

@disadone Depending on your model this may not be a problem: the non-convexity / non-concavity in the M-step (messed that up in my last post :slight_smile:) is problematic if you need to use a numerical algorithm to find the next parameter iterate, but sometimes this can be done analytically. If this is not possible you can use other methods from non-convex optimization, as you mentioned SGD can be a good alternative, especially if the dimension of your parameter is big. Otherwise conventional methods from non-convex optimization might also help. See e.g. scipy.optimize.minimize — SciPy v1.10.0 Manual for the available solvers in scipy.optimize.

Can you elaborate a little bit more about your observation distribution and associated parameters?

@StefanHeyder Thanks for your concern!

My function is about 3 observation states

\frac{1}{1+\exp(\beta _0+\beta _1 x)}
(1-\frac{1}{1+\exp (\beta _0+\beta _1 x )})(\frac{1}{1+\exp(-\beta _2 x )})
(1-\frac{1}{1+\exp (\beta _0+\beta _1 x )}) (1-\frac{1}{1+\exp (-\beta _2 x )})

for latent state 1. There are 2 latent states and for latent 2, there is another set of betas.

My optimization succeeded for now using SGD (actually, I use Julia :slightly_smiling_face:), but I am not sure of the result :melting_face:. I see that the Book (PROBML) section 29.4.3 says " 39. Fitting HMMs using maximum likelihood is difficult, because the log likelihood is not convex. Thus there are many local optima, and EM and SGD can give poor results."

I am trying spectral methods as the book recommends. Dynamax seems not implement the method. Hopefully, it can be implemented one day.

FYI Latex formatting is available. I expect well make heavy use of this in this book club

x = \frac{y}{z}

Just realizing that the spectral method is not suitable for any observation functions where the observation depends outside the latent variables :melting_face:.

Just a few thoughts regarding you problem:

  • The dimension of your problem is relatively low (2 sets of 3 betas = 6 parameters) so fitting with EM should be fast. This will allow you to use multiple starting points for the optimization so you may be able to discard some local optima (still you can never be sure that you did not miss the global optimum).
  • I am not sure that there is an advantage of using SGD for your problem, as far as I know its advantage would lie in faster computation of the gradient but unless you have lots of observations I doubt that there is an advantage over conventional gradient-based optimization routines.
  • I am not sure what the spectral method would refer to in this context - do you have a link?

I think the gist of the problem is that its a moderately high-dimensional optimization problem. WIth such problems, if they are non-convex, finding the global optimum is usually not guaranteed unless you can do some clever analytical manipulation. You can always do sanity checks (look at gradient / hessian at the suspected optimum, use multiple starting points, …) but guarantees are hard to give.

Though SGD is not guaranteed global optimum, EM is worse because the (observation) loglikelihood increases as the iterations go on.

About the spectral method:

The book
Section 29.4.3

The spectral method I tried

Enjoy :slightly_smiling_face:

1 Like