Alguma maneira mais rápida de verificar se as listas em uma lista são equivalentes?

9

Aqui eu tenho números inteiros 1:7para quatro partições diferentes, ou seja, {1}, {2,3,4}, {5,6} e {7} e essas partições são escritas em uma lista, ou seja list(1,c(2,3,4),c(5,6),7),. Trato as partições como conjuntos, de modo que diferentes permutações de elementos em uma partição sejam reconhecidas como a mesma. Por exemplo, list(1,c(2,3,4),c(5,6),7)e list(7,1,c(2,3,4),c(6,5))são equivalentes.

Observe que, não há repetição para elementos na lista, por exemplo, não list(c(1,2),c(2,1),c(1,2)), pois esse problema está discutindo partições exclusivas em todo o conjunto.

Listei algumas das diferentes permutações na lista lstcomo abaixo

lst <- list(list(1,c(2,3,4),c(5,6),7),
            list(c(2,3,4),1,7,c(5,6)),
            list(1,c(2,3,4),7,c(6,5)),
            list(7,1,c(3,2,4),c(5,6)))

e o que eu quero fazer é verificar se todas as permutações são equivalentes. Se sim, então obtemos resultado TRUE.

O que eu fiz até agora é para classificar os elementos dentro de cada partição, e usado setdiff()com interset()e union()julgá-lo (ver o meu código abaixo)

s <- Map(function(v) Map(sort,v),lst)
equivalent <- length(setdiff(Reduce(union,s),Reduce(intersect,s),))==0

No entanto, acho que esse método seria lento sempre que o tamanho da partição aumentar. Existe alguma abordagem mais rápida para fazer isso? Apreciado com antecedência!

alguns casos de teste (dados pequenos)

# should return `TRUE`
lst1 <- list(list(1,c(2,3,4),c(5,6)),
            list(c(2,3,4),1,c(5,6)),
            list(1,c(2,3,4),c(6,5)))

# should return `TRUE`
lst2 <- list(list(1:2, 3:4), list(3:4, 1:2))

# should return `FALSE`
lst3 <- list(list(1,c(2,3,4),c(5,6)), list(c(2,3,4),1,c(5,6)), list(1,c(2,3,5),c(6,4)))

r algorithm list performance ThomasIsCoding
fonte

11

Eu acho que você pode evitar as várias Mapchamadas

akrun

11

Eu sugiro adicionar mais alguns casos de teste à sua pergunta, um com partições de tamanho igual lst_equal = list(list(1:2, 3:4), list(3:4, 1:2))e também um onde o resultado deve ser FALSE, talvezlst_false <- list(list(1,c(2,3,4),c(5,6)), list(c(2,3,4),1,c(5,6)), list(1,c(2,3,5),c(6,4)))

Gregor Thomas

3

Eu recomendo fortemente ter vários pequenos exemplos - incluindo alguns onde está o resultado esperado FALSE. Dessa forma, quando uma resposta funciona em alguns casos de teste, mas não em todos, é fácil diagnosticar o motivo. Quando há apenas um exemplo, você perde nuances nos resultados do teste. Também é bom adicionar novos exemplos em vez de alterar os exemplos existentes nas pessoas que já trabalharam neles.

Gregor Thomas

11

Quero adicionar um comentário de que sua descrição me faz pensar que você espera que o resultado seja VERDADEIRO, você está apenas verificando. Se não fosse esse o caso (por exemplo, se você acha que receberá um número significativo de FALSEs), e especialmente se o comprimento lstfor potencialmente longo, você poderá obter eficiência com outras abordagens. Por exemplo, um primeiro cheque que length(unique(lengths(lst))) == 1seria muito rapidamente voltar FALSEse alguma das listas internas tem o número errado de elementos ....

Gregor Thomas

11

Se isso for aprovado, convém analisar um item de cada vez lst, comparando-o lst[[i]]com lst[[1]], e assim você pode parar assim que encontrar uma incompatibilidade, em vez de fazer todas as comparações. Se lsté longo FALSEes são comuns, isso pode ser um grande ganho de eficiência, mas provavelmente não vale a pena.

Gregor Thomas

6

Um post sobre Re qualquer variante do fast não está completo sem uma solução com o rcpp .

Para maximizar a eficiência, escolher a estrutura de dados correta será de extrema importância. Nossa estrutura de dados precisa armazenar valores exclusivos e também ter acesso / inserção rápida. Isso é exatamente o que std :: unordered_set incorpora. Precisamos apenas determinar como podemos identificar exclusivamente cada um vectordos não-ordenados integers.

Entre no Teorema Fundamental da Aritmética

O TLC afirma que todo número pode ser representado exclusivamente (até a ordem dos fatores) pelo produto de números primos.

Aqui está um exemplo demonstrando como podemos usar o FTA para decifrar rapidamente se dois vetores são equivalentes até a ordem (NB Pabaixo está uma lista de números primos ... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to              product
vec1 = (1, 2, 7)    -->>    P[1], P[2], P[7]   --->>   2,  3, 17     -->>   102
vec2 = (7, 3, 1)    -->>    P[7], P[3], P[1]   --->>  17,  5,  2     -->>   170
vec3 = (2, 7, 1)    -->>    P[2], P[7], P[1]   --->>   3, 17,  2     -->>   102

A partir disso, vemos vec1e mapeamos vec3corretamente para o mesmo número, enquanto vec2são mapeados para um valor diferente.

Como nossos vetores reais podem conter até cem números inteiros menores que 1000, a aplicação do FTA produzirá números extremamente grandes. Podemos contornar isso, aproveitando a regra do produto do logaritmo:

log _b (xy) = log _b (x) + log _b (y)

Com isso à nossa disposição, poderemos lidar com exemplos de números muito maiores (isso começa a se deteriorar em exemplos extremamente grandes).

Primeiro, precisamos de um gerador simples de números primos (NB: Na verdade, estamos gerando o log de cada número primo).

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

void getNPrimes(std::vector<double> &logPrimes) {

    const int n = logPrimes.size();
    const int limit = static_cast<int>(2.0 * static_cast<double>(n) * std::log(n));
    std::vector<bool> sieve(limit + 1, true);

    int lastP = 3;
    const int fsqr = std::sqrt(static_cast<double>(limit));

    while (lastP <= fsqr) {
        for (int j = lastP * lastP; j <= limit; j += 2 * lastP)
            sieve[j] = false;

        int ind = 2;

        for (int k = lastP + 2; !sieve[k]; k += 2)
            ind += 2;

        lastP += ind;
    }

    logPrimes[0] = std::log(2.0);

    for (int i = 3, j = 1; i <= limit && j < n; i += 2)
        if (sieve[i])
            logPrimes[j++] = std::log(static_cast<double>(i));
}

E aqui está a principal implementação:

// [[Rcpp::export]]
bool f_Rcpp_Hash(List x) {

    List tempLst = x[0];
    const int n = tempLst.length();
    int myMax = 0;

    // Find the max so we know how many primes to generate
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        const int tempMax = *std::max_element(v.cbegin(), v.cend());

        if (tempMax > myMax)
            myMax = tempMax;
    }

    std::vector<double> logPrimes(myMax + 1, 0.0);
    getNPrimes(logPrimes);
    double sumMax = 0.0;

    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;

        for (auto j: v)
            mySum += logPrimes[j];

        if (mySum > sumMax)
            sumMax = mySum;
    }

    // Since all of the sums will be double values and we want to
    // ensure that they are compared with scrutiny, we multiply
    // each sum by a very large integer to bring the decimals to
    // the right of the zero and then convert them to an integer.
    // E.g. Using the example above v1 = (1, 2, 7) & v2 = (7, 3, 1)
    //              
    //    sum of log of primes for v1 = log(2) + log(3) + log(17)
    //                               ~= 4.62497281328427
    //
    //    sum of log of primes for v2 = log(17) + log(5) + log(2)
    //                               ~= 5.13579843705026
    //    
    //    multiplier = floor(.Machine$integer.max / 5.13579843705026)
    //    [1] 418140173
    //    
    // Now, we multiply each sum and convert to an integer
    //    
    //    as.integer(4.62497281328427 * 418140173)
    //    [1] 1933886932    <<--   This is the key for v1
    //
    //    as.integer(5.13579843705026 * 418140173)
    //    [1] 2147483646    <<--   This is the key for v2

    const uint64_t multiplier = std::numeric_limits<int>::max() / sumMax;
    std::unordered_set<uint64_t> canon;
    canon.reserve(n);

    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;

        for (auto j: v)
            mySum += logPrimes[j];

        canon.insert(static_cast<uint64_t>(multiplier * mySum));
    }

    const auto myEnd = canon.end();

    for (auto it = x.begin() + 1; it != x.end(); ++it) {
        List tempLst = *it;

        if (tempLst.length() != n)
            return false;

        for (int j = 0; j < n; ++j) {
            IntegerVector v = tempLst[j];
            double mySum = 0.0;

            for (auto k: v)
                mySum += logPrimes[k];

            const uint64_t key = static_cast<uint64_t>(multiplier * mySum);

            if (canon.find(key) == myEnd)
                return false;
        }
    }

    return true;
}

Aqui estão os resultados, quando aplicados lst1, lst2, lst3, & lst (the large one)pelo @GKi.

f_Rcpp_Hash(lst)
[1] TRUE

f_Rcpp_Hash(lst1)
[1] TRUE

f_Rcpp_Hash(lst2)
[1] FALSE

f_Rcpp_Hash(lst3)
[1] FALSE

E aqui estão alguns benchmarks com o unitsconjunto de parâmetros para relative.

microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst3)
               , f_chinsoon12(lst3)
               , f_GKi_6a(lst3)
               , f_GKi_6b(lst3)
               , f_Rcpp_Hash(lst3))
Unit: relative
                 expr       min        lq      mean    median        uq       max neval
f_ThomsIsCoding(lst3) 84.882393 63.541468 55.741646 57.894564 56.732118 33.142979    10
   f_chinsoon12(lst3) 31.984571 24.320220 22.148787 22.393368 23.599284 15.211029    10
       f_GKi_6a(lst3)  7.207269  5.978577  5.431342  5.761809  5.852944  3.439283    10
       f_GKi_6b(lst3)  7.399280  5.751190  6.350720  5.484894  5.893290  8.035091    10
    f_Rcpp_Hash(lst3)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    10


microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst)
               , f_chinsoon12(lst)
               , f_GKi_6a(lst)
               , f_GKi_6b(lst)
               , f_Rcpp_Hash(lst))
Unit: relative
                expr        min         lq       mean     median        uq       max neval
f_ThomsIsCoding(lst) 199.776328 202.318938 142.909407 209.422530 91.753335 85.090838    10
   f_chinsoon12(lst)   9.542780   8.983248   6.755171   9.766027  4.903246  3.834358    10
       f_GKi_6a(lst)   3.169508   3.158366   2.555443   3.731292  1.902140  1.649982    10
       f_GKi_6b(lst)   2.992992   2.943981   2.019393   3.046393  1.315166  1.069585    10
    f_Rcpp_Hash(lst)   1.000000   1.000000   1.000000   1.000000  1.000000  1.000000    10

Aproximadamente 3x mais rápido que a solução mais rápida já vista no exemplo maior.

O que isto significa?

Para mim, esse resultado base Rmostra muito a beleza e a eficiência exibidas por @GKi, @ chinsoon12, @Gregor, @ThomasIsCoding e muito mais. Escrevemos cerca de 100 linhas de muito específicas C++para obter uma velocidade moderada. Para ser justo, as base Rsoluções acabam chamando principalmente o código compilado e acabam utilizando tabelas de hash, como fizemos acima.

Joseph Wood
fonte

11

@ThomasIsCoding, estou honrado por você ter escolhido minha resposta, mas acredito sinceramente que as outras respostas são melhores.

Joseph Wood

11

Muito obrigado pela sua contribuição! Seu trabalho é excelente!

214198 Thomas'sCoding

5

Após a classificação, você pode usar duplicatede all.

s <- lapply(lst, function(x) lapply(x, sort)) #Sort vectors
s <- lapply(s, function(x) x[order(vapply(x, "[", 1, 1))]) #Sort lists
all(duplicated(s)[-1]) #Test if there are all identical
#length(unique(s)) == 1 #Alternative way to test if all are identical

Alternativa: classifique em um loop

s <- lapply(lst, function(x) {
  tt <- lapply(x, sort)
  tt[order(vapply(tt, "[", 1, 1))]
})
all(duplicated(s)[-1])

Alternativa: classifique durante o loop e permita a saída antecipada

s <- lapply(lst[[1]], sort)
s <- s[order(vapply(s, "[", 1, 1))]
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  x <- x[order(vapply(x, "[", 1, 1))]
  if(!identical(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

ou usando setequal

s <- lapply(lst[[1]], sort)
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  if(!setequal(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

ou melhorando um pouco a ideia de @ chinsoon12 para trocar a lista por um vetor!

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  x <- rep(seq_along(x), lengths(x))[order(unlist(x))]
  if(!identical(s, x)) {tt <- FALSE; break;}
}
tt

ou evite o segundo order

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
y <- s
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  y <- y[0]
  y[unlist(x)] <- rep(seq_along(x), lengths(x))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

ou trocar ordercom match(ou fmatch)

x <- lst[[1]]
s <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
s <- match(s, unique(s))
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]]
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  y <- match(y, unique(y))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

Ou sem saída antecipada.

s <- lapply(lst, function(x) {
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  match(y, unique(y))
})
all(duplicated(s)[-1])

ou escrito em C ++

sourceCpp(code = "#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
bool f_GKi_6_Rcpp(const List &x) {
  const List &x0 = x[0];
  const unsigned int n = x0.length();
  unsigned int nn = 0;
  for (List const &i : x0) {nn += i.length();}
  std::vector<int> s(nn);
  for (unsigned int i=0; i<n; ++i) {
    const IntegerVector &v = x0[i];
    for (int const &j : v) {
      if(j > nn) return false;
      s[j-1] = i;
    }
  }
  {
    std::vector<int> lup(n, -1);
    int j = 0;
    for(int &i : s) {
      if(lup[i] < 0) {lup[i] = j++;}
      i = lup[i];
    }
  }
  for (List const &i : x) {
    if(i.length() != n) return false;
    std::vector<int> sx(nn);
    for(unsigned int j=0; j<n; ++j) {
      const IntegerVector &v = i[j];
      for (int const &k : v) {
        if(k > nn) return false;
        sx[k-1] = j;
      }
    }
    {
      std::vector<int> lup(n, -1);
      int j = 0;
      for(int &i : sx) {
        int &lupp = lup[i];
        if(lupp == -1) {lupp = j; i = j++;
        } else {i = lupp;}
      }
    }
    if(s!=sx) return false;
  }
  return true;
}
")

Obrigado a @Gregor pelas dicas para melhorar a resposta!

GKi
fonte

Eu não acho que funcionaria ao ter partições de tamanho igual ,,, mas deveria ser mais rápido que o meu quando tiver partições de tamanho desigual. Por exemplo,

lst <- list(list(1,c(2,3,4),c(5,6),7),             list(c(2,3,4),1,7,c(5,6)),             list(1,c(2,3,4),7,c(6,5)),             list(7,1,c(3,2,4),c(5,6)))

será julgado comoFALSE

ThomasIsCoding 10/12/19

11

@ Gregor Obrigado pela dica de classificação min!

GKI

Parece ótimo! Vou esperar um pouco mais para ver se há outra solução mais rápida.

ThomasIsCoding

Quais são as dimensões reais do seu conjunto de dados para você procurar uma solução mais rápida?

chinsoon12

Adicionei benchmarks de desempenho para ver a eficiência (veja meu post recém-editado). Sua solução é mais rápida que a minha, principalmente a de duas etapas. Gostaria de esperar até que qualquer solução com melhorias maiores apareça, caso contrário, a sua será aceita como a melhor. Mais uma vez obrigado!

ThomasIsCoding

4

Atuação:

library(microbenchmark)

microbenchmark(check = 'equal', times=10
  , f_ThomsIsCoding(lst1)
  , f_chinsoon12(lst1)
  , f_GKi_6a(lst1)
  , f_GKi_6b(lst1)
  , f_GKi_6_Rcpp(lst1)
  , f_Rcpp_Hash(lst1))
#Unit: microseconds
#                  expr        min         lq        mean     median         uq        max neval
# f_ThomsIsCoding(lst1) 161187.790 162453.520 167107.5739 167899.471 169441.028 174746.156    10
#    f_chinsoon12(lst1)  64380.792  64938.528  66983.9449  67357.924  68487.438  69201.032    10
#        f_GKi_6a(lst1)   8833.595   9201.744  10377.5844   9407.864  12145.926  14662.022    10
#        f_GKi_6b(lst1)   8815.592   8913.950   9877.4948   9112.924  10941.261  12553.845    10
#    f_GKi_6_Rcpp(lst1)    394.754    426.489    539.1494    439.644    451.375   1327.885    10
#     f_Rcpp_Hash(lst1)    327.665    374.409    499.4080    398.101    495.034   1198.674    10

microbenchmark(check = 'equal', times=10
  , f_ThomsIsCoding(lst2)
  , f_chinsoon12(lst2)
  , f_GKi_6a(lst2)
  , f_GKi_6b(lst2)
  , f_GKi_6_Rcpp(lst2)
  , f_Rcpp_Hash(lst2))
#Unit: microseconds
#                  expr       min        lq        mean      median         uq        max neval
# f_ThomsIsCoding(lst2) 93808.603 99663.651 103358.2039 104676.1600 107124.879 107485.696    10
#    f_chinsoon12(lst2)   131.320   147.192    192.5354    188.1935    205.053    337.062    10
#        f_GKi_6a(lst2)  8630.970  9554.279  10681.9510   9753.2670  11970.377  13489.243    10
#        f_GKi_6b(lst2)    39.736    47.916     61.3929     52.7755     63.026    110.808    10
#    f_GKi_6_Rcpp(lst2)    43.017    51.022     72.8736     76.3465     86.527    116.060    10
#     f_Rcpp_Hash(lst2)     3.667     4.237     20.5887     16.3000     18.031     96.728    10

microbenchmark(check = 'equal', times=10
  , f_ThomsIsCoding(lst3)
  , f_chinsoon12(lst3)
  , f_GKi_6a(lst3)
  , f_GKi_6b(lst3)
  , f_GKi_6_Rcpp(lst3)
  , f_Rcpp_Hash(lst3))
#Unit: microseconds
#                  expr        min         lq        mean      median         uq        max neval
# f_ThomsIsCoding(lst3) 157660.501 166914.782 167067.2512 167204.9065 168055.941 177153.694    10
#    f_chinsoon12(lst3)    139.157    181.019    183.9257    188.0950    198.249    211.860    10
#        f_GKi_6a(lst3)   9484.496   9617.471  10709.3950  10056.1865  11812.037  12830.560    10
#        f_GKi_6b(lst3)     33.583     36.338     47.1577     42.6540     63.469     66.640    10
#    f_GKi_6_Rcpp(lst3)     60.010     60.455     89.4963     94.7220    104.271    121.431    10
#     f_Rcpp_Hash(lst3)      4.404      5.518      9.9811      6.5115     17.396     20.090    10

microbenchmark(check = 'equal', times=10
  , f_ThomsIsCoding(lst4)
  , f_chinsoon12(lst4)
  , f_GKi_6a(lst4)
  , f_GKi_6b(lst4)
  , f_GKi_6_Rcpp(lst4)
  , f_Rcpp_Hash(lst4))
#Unit: milliseconds
#                  expr         min          lq       mean      median          uq        max neval
# f_ThomsIsCoding(lst4) 1874.129146 1937.643431 2012.99077 2002.460746 2134.072981 2187.46886    10
#    f_chinsoon12(lst4)   69.949917   74.393779   80.25362   76.595763   87.116571  100.57917    10
#        f_GKi_6a(lst4)   23.259178   23.328548   27.62690   28.856612   30.675259   32.57509    10
#        f_GKi_6b(lst4)   22.200969   22.326122   24.20769   23.023687   23.619360   31.74266    10
#    f_GKi_6_Rcpp(lst4)    8.062451    8.228526   10.30559    8.363314   13.425531   13.80677    10
#     f_Rcpp_Hash(lst4)    6.551370    6.586025    7.22958    6.724232    6.809745   11.97631    10

Bibliotecas:

system.time(install.packages("Rcpp"))
#       User      System verstrichen 
#     27.576       1.147      29.396 

system.time(library(Rcpp))
#       User      System verstrichen 
#      0.070       0.000       0.071

Funções:

system.time({f_ThomsIsCoding <- function(lst) {
  s <- Map(function(v) Map(sort,v),lst)
  length(setdiff(Reduce(union,s),Reduce(intersect,s)))==0
}})
#       User      System verstrichen 
#          0           0           0 

#like GKi's solution to stop early when diff is detected
system.time({f_chinsoon12  <- function(lst) {
    x <- lst[[1L]]
    y <- x[order(lengths(x), sapply(x, min))]
    a <- rep(seq_along(y), lengths(y))[order(unlist(y))]
    for(x in lst[-1L]) {
        y <- x[order(lengths(x), sapply(x, min))]
        a2 <- rep(seq_along(y), lengths(y))[order(unlist(y))]
        if(!identical(a, a2)) {
            return(FALSE)
        }
    }
    TRUE
}})
#       User      System verstrichen 
#          0           0           0 

system.time({f_GKi_6a <- function(lst) {
  all(duplicated(lapply(lst, function(x) {
    y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
    match(y, unique(y))
  }))[-1])
}})
#      User      System verstrichen 
#          0           0           0 

system.time({f_GKi_6b <- function(lst) {
  x <- lst[[1]]
  s <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  s <- match(s, unique(s))
  for(i in seq(lst)[-1]) {
    x <- lst[[i]]
    y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
    y <- match(y, unique(y))
    if(!identical(s, y)) return(FALSE)
  }
  TRUE
}})
#       User      System verstrichen 
#          0           0           0 

system.time({sourceCpp(code = "#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
bool f_GKi_6_Rcpp(const List &x) {
  const List &x0 = x[0];
  const unsigned int n = x0.length();
  unsigned int nn = 0;
  for (List const &i : x0) {nn += i.length();}
  std::vector<int> s(nn);
  for (unsigned int i=0; i<n; ++i) {
    const IntegerVector &v = x0[i];
    for (int const &j : v) {
      if(j > nn) return false;
      s[j-1] = i;
    }
  }
  {
    std::vector<int> lup(n, -1);
    int j = 0;
    for(int &i : s) {
      if(lup[i] < 0) {lup[i] = j++;}
      i = lup[i];
    }
  }
  for (List const &i : x) {
    if(i.length() != n) return false;
    std::vector<int> sx(nn);
    for(unsigned int j=0; j<n; ++j) {
      const IntegerVector &v = i[j];
      for (int const &k : v) {
        if(k > nn) return false;
        sx[k-1] = j;
      }
    }
    {
      std::vector<int> lup(n, -1);
      int j = 0;
      for(int &i : sx) {
        int &lupp = lup[i];
        if(lupp == -1) {lupp = j; i = j++;
        } else {i = lupp;}
      }
    }
    if(s!=sx) return false;
  }
  return true;
}
")})
#       User      System verstrichen 
#      3.265       0.217       3.481 

system.time({sourceCpp(code = "#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

void getNPrimes(std::vector<double> &logPrimes) {
    const int n = logPrimes.size();
    const int limit = static_cast<int>(2.0 * static_cast<double>(n) * std::log(n));
    std::vector<bool> sieve(limit + 1, true);
    int lastP = 3;
    const int fsqr = std::sqrt(static_cast<double>(limit));

    while (lastP <= fsqr) {
        for (int j = lastP * lastP; j <= limit; j += 2 * lastP)
            sieve[j] = false;
        int ind = 2;
        for (int k = lastP + 2; !sieve[k]; k += 2)
            ind += 2;
        lastP += ind;
    }
    logPrimes[0] = std::log(2.0);
    for (int i = 3, j = 1; i <= limit && j < n; i += 2)
        if (sieve[i])
            logPrimes[j++] = std::log(static_cast<double>(i));
}

// [[Rcpp::export]]
bool f_Rcpp_Hash(List x) {
    List tempLst = x[0];
    const int n = tempLst.length();
    int myMax = 0;
    // Find the max so we know how many primes to generate
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        const int tempMax = *std::max_element(v.cbegin(), v.cend());
        if (tempMax > myMax)
            myMax = tempMax;
    }
    std::vector<double> logPrimes(myMax + 1, 0.0);
    getNPrimes(logPrimes);
    double sumMax = 0.0;
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        for (auto j: v)
            mySum += logPrimes[j];
        if (mySum > sumMax)
            sumMax = mySum;
    }
    const uint64_t multiplier = std::numeric_limits<int>::max() / sumMax;
    std::unordered_set<uint64_t> canon;
    canon.reserve(n);
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        for (auto j: v)
            mySum += logPrimes[j];
        canon.insert(static_cast<uint64_t>(multiplier * mySum));
    }
    const auto myEnd = canon.end();
    for (auto it = x.begin() + 1; it != x.end(); ++it) {
        List tempLst = *it;
        if (tempLst.length() != n)
            return false;
        for (int j = 0; j < n; ++j) {
            IntegerVector v = tempLst[j];
            double mySum = 0.0;
            for (auto k: v)
                mySum += logPrimes[k];
            const uint64_t key = static_cast<uint64_t>(multiplier * mySum);
            if (canon.find(key) == myEnd)
                return false;
        }
    }
    return true;
}
")})
#       User      System verstrichen 
#      3.507       0.155       3.662

Dados:

lst1 <- list(list(1,c(2,3,4),c(5,6)) #TRUE
           , list(c(2,3,4),1,c(5,6))
           , list(1,c(2,3,4),c(6,5)))
lst2 <- list(list(c(2,3,4),c(1,5,6)) #FALSE
           , list(c(2,3,6),c(1,5,4))
           , list(c(2,3,4),c(1,5,6)))
lst3 <- list(list(1,c(2,3,4),c(5,6)) #FALSE
           , list(c(2,3,4),1,c(5,6))
           , list(1,c(2,3,5),c(6,4)))
set.seed(7)
N  <- 1e3
lst1 <- lst1[sample(seq(lst1), N, TRUE)]
lst2 <- lst2[sample(seq(lst2), N, TRUE)]
lst3 <- lst3[sample(seq(lst3), N, TRUE)]
N <- 1000
M <- 500
l <- unname(split(1:N,findInterval(1:N,sort(sample(1:N,N/10)),left.open = T)))
lst4 <- lapply(lapply(1:M, 
                     function(k) lapply(l, 
                                        function(v) v[sample(seq_along(v),length(v))])), function(x) x[sample(seq_along(x),length(x))])

GKi
fonte

Muito obrigado! Eu notei que eu cometi um erro de digitação no meu código, que deve ser length(setdiff(Reduce(union,s),Reduce(intersect,s)))==0 , desculpe pelo meu erro ....

ThomasIsCoding

@ThomasIsCoding A resposta é atualizada. Mas eu fiz como um Wiki, para que todos sejam bem-vindos para atualizar e incluir novas soluções e ter isso nem sempre repetido.

GKi

Obrigada pelos teus esforços! Acho que agora a minha solução dá os mesmos resultados que o seu após a correção, mas mais lento do que seu :)

ThomasIsCoding

Impressionante! Você melhora notavelmente o desempenho! Eu aceito sua solução!

ThomasIsCoding 12/12/19

@ chinsoon12 muito obrigado por me lembrar! Agora eu mudei para outro dele para aceitação

ThomasIsCoding

3

Espero que a segunda vez com sorte

f <- function(lst) {
    s <- lapply(lst, function(x) {
        y <- x[order(lengths(x), sapply(x, min))]
        rep(seq_along(y), lengths(y))[order(unlist(y))]
    })
    length(unique(s))==1L
}

casos de teste:

# should return `TRUE`
lst1 <- list(list(1,c(2,3,4),c(5,6)),
    list(c(2,3,4),1,c(5,6)),
    list(1,c(2,3,4),c(6,5)))

# should return `TRUE`
lst2 <- list(list(1:2, 3:4), list(3:4, 1:2))

# should return `FALSE`
lst3 <- list(list(1,c(2,3,4),c(5,6)), list(c(2,3,4),1,c(5,6)), list(1,c(2,3,5),c(6,4)))

# should return `FALSE`
lst4 <- list(list(c(2,3,4),c(1,5,6)), list(c(2,3,6),c(1,5,4)), list(c(2,3,4),c(1,5,6)))

lst5 <- list(list(1,c(2,3,4),c(5,6)) #TRUE
    , list(c(2,3,4),1,c(5,6))
    , list(1,c(2,3,4),c(6,5)))
lst6 <- list(list(c(2,3,4),c(1,5,6)) #FALSE
    , list(c(2,3,6),c(1,5,4))
    , list(c(2,3,4),c(1,5,6)))
lst7 <- list(list(1,c(2,3,4),c(5,6)) #FALSE
    , list(c(2,3,4),1,c(5,6))
    , list(1,c(2,3,5),c(6,4)))

Verificações:

f(lst1)
#[1] TRUE
f(lst2)
#[1] TRUE
f(lst3)
#[1] FALSE
f(lst4)
#[1] FALSE
f(lst5)
#[1] TRUE
f(lst6)
#[1] FALSE
f(lst7)
#[1] FALSE

código de temporização:

library(microbenchmark)
set.seed(0L)
N <- 1000
M <- 100
l <- unname(split(1:N,findInterval(1:N,sort(sample(1:N,N/10)),left.open = T)))
lst <- lapply(lapply(1:M,
    function(k) lapply(l,
        function(v) v[sample(seq_along(v),length(v))])), function(x) x[sample(seq_along(x),length(x))])

f_ThomsIsCoding <- function(lst) {
    s <- Map(function(v) Map(sort,v),lst)
    length(setdiff(Reduce(union,s),Reduce(intersect,s)))==0
}

f_GKi_1 <- function(lst) {
    all(duplicated(lapply(lst, function(x) lapply(x, sort)[order(unlist(lapply(x, min)))]))[-1])
}

f_GKi_2 <- function(lst) {
    s <- lapply(lst, function(x) lapply(x, sort))
    all(duplicated(lapply(s, function(x) x[order(unlist(lapply(x, "[", 1)))]))[-1])
}


f <- function(lst) {
    s <- lapply(lst, function(x) {
        y <- x[order(lengths(x), sapply(x, min))]
        rep(seq_along(y), lengths(y))[order(unlist(y))]
    })
    length(unique(s))==1L
}

microbenchmark(times=3L,
    f_ThomsIsCoding(lst),
    f_GKi_1(lst),
    f_GKi_2(lst),
    f(lst)
)

horários:

Unit: milliseconds
                 expr       min        lq      mean    median        uq      max neval
 f_ThomsIsCoding(lst) 333.77313 334.61662 348.37474 335.46010 355.67555 375.8910     3
         f_GKi_1(lst) 324.12827 324.66580 326.33016 325.20332 327.43111 329.6589     3
         f_GKi_2(lst) 315.73533 316.05770 333.35910 316.38007 342.17099 367.9619     3
               f(lst)  12.42986  14.08256  15.74231  15.73526  17.39853  19.0618     3

chinsoon12
fonte

Sim, funciona bem neste momento

ThomasIsCoding 11/12/19

Alguma maneira mais rápida de verificar se as listas em uma lista são equivalentes?

Respostas:

Entre no Teorema Fundamental da Aritmética