Palavra fatoração em de tempo

12

Dadas duas cadeias , escrevemos para sua concatenação. Dada uma cadeia e número inteiro , escreve-se para a concatenação de cópias de . Agora, dada uma string, podemos usar essa notação para 'compactá-la', ou seja, pode ser escrito como . Vamos chamar o peso de uma compactação do número de caracteres que aparecem nela; portanto, o peso de é dois, e o peso de (uma compactação de ) é três (separado $S_1, S_2$ $S_1S_2$ $S$ $k\geq 1$ $(S)^k = SS\cdots S$ $k$ $S$ $AABAAB$ $((A)^2 B)^2$ $((A)^2 B^2)$ $(AB)^2 A$ $ABABA$ $A$ s são contados separadamente).

Agora considere o problema de calcular a compressão 'mais leve' de uma determinada sequência com . Depois de pensar um pouco, existe uma abordagem de programação dinâmica óbvia que é executada em ou dependendo da abordagem exata. $S$ $|S|=n$ $O(n^3 \log n)$ $O(n^3)$

No entanto, disseram-me que esse problema pode ser resolvido em , embora não seja possível encontrar nenhuma fonte sobre como fazer isso. Especificamente, esse problema foi apresentado em um concurso de programação recente (problema K aqui , últimas duas páginas). Durante a análise, foi apresentado um algoritmo , e no final foi mencionado o limite pseudo-quadrático ( aqui na marca de quatro minutos). Infelizmente, o apresentador apenas se referiu a 'um lema complicado de combinatória de palavras', então agora eu vim aqui para pedir a solução :-) $O(n^2 \log n)$ $O(n^3 \log n)$

dynamic-programming word-combinatorics Timon Knigge
fonte

Apenas uma propriedade aleatória: se, para uma sequência

, temos

, também deve ser

[Corrigi um erro aqui], com

tendo o comprimento

(que não pode ser maior que

ou

S

$S$

S = X^{a} = Y^{b}

$S=X^a=Y^b$

S = Z^{| S | / gcd (| X |, | Y |)}

$S=Z^{|S|/\gcd(|X|, |Y|)}$

Z

$Z$

gcd (| X |, | Y |)

$\gcd(|X|, |Y|)$

X

$X$

Y

$Y$ ) Não tenho certeza de como isso é útil. Se você já descobriu que

e sabe que

contém pelo menos 2 caracteres distintos, e agora está procurando um

menor que

, então você só precisa tentar os prefixos

de

com comprimento que divide

.

S = X^{a}

$S=X^a$

S

$S$

Y

$Y$

S = Y^{b}

$S=Y^b$

Y

$Y$

X

$X$

| X |

$|X|$

Jrandom_hacker

O problema é que, mesmo depois de reduzir todo o possível

, você ainda precisa agregar a resposta por um DP cúbico em subsegmentos (por exemplo,

), por isso ainda há algum trabalho extra a ser feito depois disso ...

X^{a}

$X^a$

D P [l, r] = min_{k} D P [l, k] + D P [k + 1, r]

$DP[l, r] = \min_k DP[l, k] + DP[k+1, r]$

Timon Knigge

Eu vejo o que você quer dizer. Acho que você precisa de algum tipo de relação de dominância que elimine alguns valores de

da necessidade de serem testados - mas não consegui pensar em nenhum. Em particular, considerei o seguinte: Suponha que

tenha uma fatoração ótima

com

; é possível que exista uma solução ótima na qual

seja fatorado como

com

? Infelizmente a resposta é sim: para

k

$k$

S [1.. i]

$S[1..i]$

S [1.. i] = X Y^{k}

$S[1..i] = XY^k$

k > 1

$k>1$

S

$S$

X Y^{j} Z

$XY^jZ$

j < k

$j<k$

,

possui fatoração ótima

, mas a fatoração ideal única para

é

.

S = A B A B C A B C

$S=ABABCABC$

S [1..4]

$S[1..4]$

(A B)^{2}

$(AB)^2$

S

$S$

A B (A B C)^{2}

$AB(ABC)^2$

Jrandom_hacker

1

Se não estou lhe entendendo mal, acho que a fatoração de custo mínimo pode ser calculada no tempo de $O(n^2)$ seguinte maneira.

Para cada índice i, iremos calcular um grupo de valores $(p_i^\ell, r_i^\ell)$ para $\ell=1,2,\ldots$ , como se segue. Seja $p_i^1\ge 1$ o número inteiro menor, de modo que exista um número inteiro $r\ge 2$ satisfazendo

S [i - r p_{i}^{1} + 1, i - p_{i}^{1}] = S [i - (r - 1) p_{i}^{1} + 1, i] .

$S[i-rp_i^1+1, i-p_i^1] = S[i-(r-1)p_i^1+1, i].$ For this particular

p_{i}^{1}

$p_i^1$ , let

r_{i}^{1}

$r_i^1$ be the largest

r

$r$ with this property. If no such

p_{i}

$p_i$ exists, set

L_{i} = 0

$L_i=0$ so we know there are zero

(p_{i}^{ℓ}, r_{i}^{ℓ})

$(p_i^\ell,r_i^\ell)$ values for this index.

Let $p_i^2$ be the smallest integer strictly bigger than $(r_i^1-1)p_i^1$ satisfying, likewise,

S [i - r_{i}^{2} p_{i}^{2} + 1, i - p_{i}^{2}] = S [i - (r_{i}^{2} - 1) p_{i}^{2} + 1, i]

$S[i-r_i^2p_i^2+1, i-p_i^2] = S[i-(r_i^2-1)p_i^2+1, i]$ for some

r_{i}^{2} \geq 2

$r_i^2\ge 2$ . Like before, take

r_{i}^{2}

$r_i^2$ to be the maximal one having fixed

p_{i}^{2}

$p_i^2$ . In general

p_{i}^{ℓ}

$p_i^\ell$ is the smallest such number strictly bigger than

(r_{i}^{ℓ - 1} - 1) p_{i}^{ℓ - 1}

$(r_i^{\ell-1}-1)p_i^{\ell-1}$ . If no such

p_{i}^{ℓ}

$p_i^\ell$ exists, then

L_{i} = ℓ - 1

$L_i=\ell-1$ .

Note that for each index i, we have $L_i=O(\log (i+1))$ due to $p_i^\ell$ values increasing geometrically with $\ell$ . (if $p_i^{\ell+1}$ exists, it's not just strictly bigger than $(r_i^\ell-1)p_i^\ell$ but bigger than that by at least $p_i^\ell/2$ . This establishes the geometric increase.)

Suppose now all $(p_i^\ell,r_i^\ell)$ values are given to us. The minimum cost is given by the recurrence
$d p (i, j) = min {d p (i, j - 1) + 1, min_{ℓ} (d p (i, j - r_{j}^{ℓ} p_{j}^{ℓ}) + d p (j - r_{j}^{ℓ} p_{j}^{ℓ} + 1, j - p_{j}^{ℓ}))}$ $\mathrm{dp}(i,j) = \min\left\{\mathrm{dp}(i, j-1) + 1, \min_\ell \left(\mathrm{dp}\left(i,j - r_j^\ell p_j^\ell\right) + \mathrm{dp}(j-r_j^\ell p_j^\ell+1,j-p_j^\ell)\right)\right\}$ with the understanding that for $i>j$ we set $\mathrm{dp}(i,j) = +\infty$ . The table can be filled in $O(n^2 + n\sum_j L_j)$ time.

We already observed above that $\sum_j L_j = O(\sum_j \log (j+1)) = \Theta(n\log n)$ by bounding the sum term by term. But actually if we look at the whole sum, we can prove something sharper.

Consider the suffix tree $T(\overleftarrow{S})$ of the reverse of $S$ (i.e., the prefix tree of S). We will charge each contribution to the sum $\sum_i L_i$ to an edge of $T(\overleftarrow{S})$ so that each edge will be charged at most once. Charge each $p_i^j$ to the edge emanating from $\mathrm{nca}(v(i), v(i-p_i^j))$ and going towards $v(i-p_i^j)$ . Here $v(i)$ is the leaf of the prefix tree corresponding to $S[1..i]$ and nca denotes the nearest common ancestor.

This shows that $O(\sum_i L_i)=O(n)$ . The values $(p_i^j,r_i^j)$ can be calculated in time $O(n+\sum_i L_i)$ by a traversal of the suffix tree but I will leave the details to a later edit if anyone is interested.

Let me know if this makes sense.

Mert Sağlam
fonte

-1

There is your initial string S of length n. Here is the pseudo-code of the method.

next_end_bracket = n
for i in [0:n]: # main loop

    break if i >= length(S) # due to compression
    w = (next_end_bracket - i)# width to analyse

    for j in [w/2:0:-1]: # period loop, look for largest period first
        for r in [1:n]: # number of repetition loop
            if i+j*(r+1) > w:
                break r loop

            for k in [0:j-i]:
                # compare term to term and break at first difference
                if S[i+k] != S[i+r*j+k]:
                    break r loop

        if r > 1:
            # compress
            replace S[i:i+j*(r+1)] with ( S[i:i+j] )^r
            # don't forget to record end bracket...
            # and reduce w for the i-run, carrying on the j-loop for eventual smaller periods. 
            w = j-i

I intentionally gave little details on "end brackets" as it needs lot of steps to stack and unstack which would let the core method unclear. The idea is to test an eventual further contraction inside the first one. for exemple ABCBCABCBC => (ABCBC)² => (A(BC)²)².

So the main point is to look for large periods first. Note that S[i] is the ith term of S skipping any "(", ")" or power.

i-loop is O(n)
j-loop is O(n)
r+k-loops is O(log(n)) as it stops at first difference

This is globally O(n²log(n)).

Optidad
fonte

It's not clear to me that the r and k loops are O(log n) -- even separately. What ensures that a difference is found after at most O(log n) iterations?

j_random_hacker

Do I understand correctly that you are compressing greedily? Because that is incorrect, consider e.g. ABABCCCABCCC which you should factorize as AB(ABC^3)^2.

Timon Knigge

Yeah you are totally right about that, I've to think about this.

Optidad

Palavra fatoração em de tempo

Respostas: