The likelihood of observing a set of sequences \(\sigma_{i}^n\) at terminal nodes \(n\) of a tree given topology, branch lengths, and a time reversible substitution model \(Q_{ij} = \pi_i W_{ij}\) is given by
$$L(\sigma_{i}^{n}|T, Q_{ij}) = \sum_{\sigma^m_i}\prod_i\pi_{sigma^0_i}\prod_{m\in INT}}\prod_{c\in m} P(\sigma_{i}^{c}| \sigma^{m}, t_c) $$
where the first product runs all internal nodes \(m\), the second product runs over all children of node \(m\), and the sum loops over all possible sequences of node internal nodes. The polarization of nodes into parents and children can be obtained by labeling an arbitrary point in the tree as root. Since the structure is a tree, conditioning on the state of any internal node splits the problem into two smaller subproblems. This allows a recursive calculation of the likelihood as realized by Bethe and later by phylogeneticists. To optimize the branch length, one would typically calculate derivatives of the likelihood and do some form of gradient descent. Since its talkes \(N\times L \times q\) steps to calculate the likelihood, branch length optimization should scale at least as \(N^2\times L \times q\).
Alternatively, one can approximate the problem by first reconstructing the most likely ancestral states using the same recursive algorithm as above and optimize each branch length separately for fixed ancestral sequences. This is clearly faster since it requires only on global optimization of ancestral states \(N\times L \times q\) followed by local branch length optimization. This procedure can be iterated, i.e., updated ancestral sequences after a first round of branch length optimization. One obvious problem of this procedure is a systematic underestimation of branch length since subleading states of internal nodes are ignored.
This problem can be overcome by calculating the complete marginal distribution of the length of a branch, given the rest of the tree. This distribution (for every branch) can be calculated in linear time as before.