Tokenizar es el proceso de dividir textos en palabras, frases, párrafos u otros simbolos que tengan significado
La unidad fundamental de la tokenización de un texto se llama Token.
Tokens (Por frase):
Tokens (Por palabras):
Tokens (Por caracteres):
Tokens (Por bi-gramas, palabras):
BERT:
Baroni, Marco, Dinu Georgiana, Kruszewki Germán, (2014), Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors.
Hipótesis distributiva:
Palabras que ocurren y se usan en el mismo contexto, son semánticamente similares entre sí y tienen significados similares.
Fuente: Hypotheses
Fuente: Hypotheses
Fuente: Hypotheses
Matemáticamente hablando,
$$\cos(\theta)=\frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \| \vec{b} \|}$$Se intenta predecir la palabra objetivo actual - target word- (la palabra central) basándose en las palabras del contexto de origen (palabras circundantes)
"A veces son las mismísimas personas de las que nadie puede imaginarse algo, aquellas que hacen las cosas que nadie puede se imaginar".
– Alan Turing
Considerando los pares (ventana de contexto, palabra objetivo), podemos construir las siguientes asociaciones (ventana de tamaño 2):
"A veces son las mismísimas personas de las que nadie puede imaginarse algo, aquellas que hacen las cosas que nadie puede se imaginar".
([A, son], veces)
([veces, las], son)
...
([puede, algo], imaginarse)
([aquellas, hacen], que)
...
Y así sucesivamente.
Fuente: Álvaro Montenegro
Arquitectura de la red para el modelo CBOW
Fuente: Dipanjan (DJ) Sarkar
Se intenta predecir el contexto - target word- (palabras circundantes) basándose en una palabra destino (palabra central)
"A veces son las mismísimas personas de las que nadie puede imaginarse algo, aquellas que hacen las cosas que nadie puede se imaginar".
Ejemplo:
(veces, [A, son])
(son, [veces, las])
...
(imaginarse, [puede, algo])
(que, [aquellas, hacen])
...
Y así sucesivamente.
Arquitectura de la red para el modelo CBOW
Fuente: Dipanjan (DJ) Sarkar
Fuente: Training Word2vec using gensim
Fuente: FastText
ft_model.get_nearest_neighbors('personas',k=20)
[(0.9132910370826721, 'personar'), (0.9068970084190369, 'persona'), (0.8679944276809692, 'personita'), (0.8662665486335754, 'personal'), (0.8557455539703369, 'spersona'), (0.8324379324913025, 'persna'), (0.8065141439437866, 'persar'), (0.7967663407325745, 'personalmente'), (0.7958793640136719, 'personalizado'), (0.7871550917625427, 'persoan'), (0.7720642685890198, 'querer'), (0.7576236128807068, 'personalida'), (0.7519052624702454, 'perosona'), (0.7363068461418152, 'quere'), (0.7361247539520264, 'pesona'), (0.7349685430526733, 'debrer'), (0.7303863763809204, 'prematuro'), (0.7287424206733704, 'dde'), (0.726275622844696, 'caundo'), (0.7247213125228882, 'razòn')]
print("Dimensión de cada palabra:",ft_model.dim,'\n')
print(ft_model.get_word_vector('persona'))
Dimensión de cada palabra: 300 [ 5.24799526e-02 9.15666297e-02 3.97484303e-02 -6.31197989e-02 5.07091433e-02 -4.31895442e-02 7.48612210e-02 1.57487899e-01 -4.47557046e-04 -2.66112462e-02 7.72181004e-02 -8.77558589e-02 1.44190416e-01 6.08691983e-02 7.24870488e-02 1.34591535e-01 4.80714254e-02 1.02717467e-01 -1.65887475e-01 -7.34485462e-02 -7.70898312e-02 3.12843353e-01 -7.42845163e-02 8.65012407e-02 9.27838162e-02 -1.11677237e-01 3.49321961e-01 4.53560054e-02 -3.87068540e-02 -1.48777112e-01 -1.52624473e-01 3.16525936e-01 -1.90377478e-02 7.65117779e-02 -4.04428989e-02 7.98542518e-03 -2.73854584e-01 -3.61466818e-02 -2.04677321e-02 1.27188370e-01 2.35081464e-01 -6.66056499e-02 1.02620669e-01 7.48200193e-02 6.60037622e-02 1.00959092e-01 -1.76436678e-01 2.66944677e-01 -4.28780913e-02 -2.13320721e-02 -2.34207902e-02 1.06124006e-01 1.96991324e-01 -9.32689980e-02 1.33505702e-01 -2.21246541e-01 -2.66178191e-01 -8.39941278e-02 -6.15192519e-04 8.78470689e-02 4.23219576e-02 -1.92235291e-01 1.29888162e-01 1.07743368e-01 1.59951136e-01 1.60280973e-01 -4.18361574e-02 -8.61715749e-02 2.31133699e-01 5.71853071e-02 1.66871920e-01 1.03985675e-01 2.98737269e-02 1.02985054e-01 1.14707299e-01 -4.47212346e-02 -2.22821191e-01 -1.39559433e-01 -2.11502552e-01 -4.75678556e-02 -1.84836164e-01 -3.15176211e-02 2.65918691e-02 -1.17193414e-02 -1.33980319e-01 -8.06756839e-02 2.40550954e-02 1.93196663e-03 -1.52128441e-02 6.17859066e-02 -6.40573427e-02 1.33018394e-03 -1.17561184e-01 -7.43099377e-02 -1.79671451e-01 -1.60949379e-01 -4.95714173e-02 -1.27502605e-01 -2.99894251e-02 -1.40175223e-02 1.34737894e-03 4.71354797e-02 4.23890911e-02 2.79361140e-02 -8.87208655e-02 -9.57543179e-02 1.63124248e-01 5.84070422e-02 6.81627728e-03 -1.30261689e-01 3.63674238e-02 9.07225609e-02 -4.39060368e-02 9.06281993e-02 -7.72138266e-03 1.51165009e-01 9.62707475e-02 3.35088484e-02 -3.22040282e-02 1.37364849e-01 2.81820633e-02 -3.08740195e-02 -2.02313587e-02 -1.01186961e-01 1.72733247e-01 8.76564458e-02 -3.14652547e-02 1.41510889e-01 -7.23000169e-02 1.32630005e-01 -1.94975939e-02 -1.00228973e-01 -7.88610801e-02 -3.16495657e-01 2.12819017e-02 -1.71729714e-01 -8.83882865e-02 -8.68360177e-02 -1.41904518e-01 -8.65067542e-02 -1.76924076e-02 6.58080876e-02 -1.92414016e-01 3.47453505e-02 -3.61692905e-03 -8.70659649e-02 7.86490962e-02 8.33094493e-03 -3.17850232e-01 -1.51938833e-02 -6.20346330e-02 -6.64604530e-02 2.77439147e-01 -8.73912591e-03 -6.03523999e-02 1.90035984e-01 7.90233817e-03 -2.02465042e-01 -1.14850275e-01 3.99961276e-03 -6.85938075e-02 -1.56790271e-01 1.97982341e-01 -3.46294953e-03 -8.39912891e-02 -4.40762304e-02 9.13099349e-02 -3.56688723e-02 -9.94093344e-02 3.76131013e-02 -1.06315218e-01 1.27028614e-01 -6.55859560e-02 -5.92799298e-02 -2.12412834e-01 7.35279992e-02 -6.34157956e-02 -2.28086323e-03 -2.18996704e-02 -8.67727473e-02 3.19884135e-03 -1.48297161e-01 -9.99149866e-03 5.26882298e-02 -4.20790985e-02 -1.11322574e-01 4.00090218e-03 9.49732214e-02 1.56512000e-02 3.58194904e-03 -1.62761495e-01 3.44549976e-02 4.41147909e-02 3.37007828e-02 1.65453210e-01 -1.03963539e-02 1.36528119e-01 -1.73845127e-01 2.99894810e-02 1.95415914e-01 1.64505854e-01 1.58594981e-01 4.23930213e-02 -3.70294265e-02 -9.33052972e-03 -6.29975200e-02 -3.52863148e-02 8.76879916e-02 -9.41523910e-02 2.24974155e-02 -5.59113286e-02 1.18585072e-01 1.19041391e-01 1.52518570e-01 5.77507280e-02 -1.41589539e-04 -9.08652171e-02 -7.87489191e-02 -6.25383854e-03 3.18288547e-03 1.00194983e-01 6.69848621e-02 1.04300432e-01 2.25197092e-01 -1.12078451e-02 -8.93331841e-02 2.21903056e-01 1.95945892e-02 1.44801646e-01 -1.98350623e-01 -5.76124415e-02 1.06810912e-01 -1.01136461e-01 -5.57082705e-02 -4.71765064e-02 -2.27710214e-02 4.22155336e-02 1.35813445e-01 -1.01012833e-01 5.61818406e-02 -1.85557816e-03 -1.34610564e-01 2.09606756e-02 7.52321854e-02 8.02275538e-03 1.14765726e-01 6.51405007e-02 -1.52398869e-02 9.37117785e-02 6.09592870e-02 5.24922498e-02 2.01582052e-02 1.14851762e-02 -2.23667309e-01 1.46928608e-01 1.05938025e-01 2.06365120e-02 1.62585974e-01 1.60172984e-01 5.45519926e-02 -5.83597943e-02 9.23366994e-02 1.07022442e-01 8.97938311e-02 1.30357044e-02 1.03451662e-01 -2.44069770e-02 7.09723681e-02 1.00608900e-01 8.87738466e-02 -3.94218117e-02 1.17261559e-01 -4.07370590e-02 1.21892244e-01 -1.56515121e-01 2.56299204e-03 1.78215817e-01 -3.49422395e-02 -7.97148943e-02 -1.39361918e-01 1.27802566e-01 4.49165590e-02 -1.49758637e-01 1.36126921e-01 2.97668576e-02 -9.15807635e-02 2.06201915e-02 3.45607772e-02 2.01696500e-01 1.84640437e-01 -8.84667188e-02 -1.29202396e-01 4.75032404e-02 6.40575141e-02 6.45425245e-02 5.94188087e-03 1.69136554e-01 -7.58974850e-02 -5.31887449e-02 -1.04609132e-02]
Areas de conocimiento Astrofísica, a partir de artículos científicos
Fuente: Alvaro Montenegro
embedding
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.241418 | -0.232559 | -0.085683 | 0.299351 | 0.109705 | 0.331151 | -0.080337 | -0.197136 | -0.427341 | -0.573067 | ... | -0.225016 | -0.815326 | 0.043260 | 0.105106 | 0.041180 | 0.060379 | 0.018453 | -0.366959 | 0.137687 | 0.080184 |
1 | 0.255045 | -0.256232 | -0.044591 | 0.315665 | 0.108994 | 0.323116 | -0.106720 | -0.222000 | -0.435245 | -0.577258 | ... | -0.216122 | -0.807650 | 0.000162 | 0.149000 | 0.030914 | 0.069198 | -0.006865 | -0.365532 | 0.120305 | 0.078757 |
2 | 0.239668 | -0.254680 | -0.070638 | 0.303795 | 0.120087 | 0.324071 | -0.091249 | -0.192964 | -0.422978 | -0.589075 | ... | -0.225875 | -0.816348 | 0.015498 | 0.131911 | 0.036075 | 0.064224 | 0.017920 | -0.373993 | 0.126668 | 0.082288 |
3 | 0.233435 | -0.230949 | -0.081912 | 0.309704 | 0.110778 | 0.327595 | -0.112487 | -0.204670 | -0.420729 | -0.566931 | ... | -0.222017 | -0.822799 | 0.025876 | 0.108496 | 0.034860 | 0.070980 | 0.019631 | -0.360372 | 0.130671 | 0.072055 |
4 | 0.243229 | -0.243130 | -0.095961 | 0.292602 | 0.104921 | 0.330812 | -0.077533 | -0.209971 | -0.430416 | -0.587449 | ... | -0.232541 | -0.829176 | 0.029503 | 0.128189 | 0.018461 | 0.060638 | -0.011216 | -0.365519 | 0.143600 | 0.074325 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7457 | 0.244309 | -0.263245 | -0.090281 | 0.291332 | 0.107840 | 0.307374 | -0.031982 | -0.228151 | -0.417979 | -0.612873 | ... | -0.229120 | -0.793140 | 0.041920 | 0.162498 | 0.015015 | 0.035492 | -0.004617 | -0.390809 | 0.152601 | 0.095524 |
7458 | 0.244283 | -0.237519 | -0.030111 | 0.291326 | 0.116971 | 0.323677 | -0.053814 | -0.238688 | -0.426710 | -0.584561 | ... | -0.206920 | -0.785080 | 0.023295 | 0.126629 | 0.024351 | 0.071057 | -0.005939 | -0.393282 | 0.141215 | 0.068864 |
7459 | 0.248388 | -0.247243 | -0.062380 | 0.288513 | 0.108986 | 0.326835 | -0.022564 | -0.233088 | -0.427388 | -0.609371 | ... | -0.227900 | -0.821805 | 0.026147 | 0.168068 | 0.027383 | 0.020774 | 0.025495 | -0.389975 | 0.142002 | 0.083954 |
7460 | 0.291054 | -0.261208 | -0.042859 | 0.281592 | 0.112613 | 0.312341 | -0.043446 | -0.256389 | -0.448102 | -0.611017 | ... | -0.206487 | -0.823019 | -0.021799 | 0.135237 | 0.031224 | 0.012056 | 0.009856 | -0.370721 | 0.153655 | 0.102042 |
7461 | 0.238826 | -0.259780 | -0.079581 | 0.294706 | 0.105108 | 0.307628 | -0.024828 | -0.229253 | -0.416846 | -0.598813 | ... | -0.235458 | -0.807307 | -0.000835 | 0.172921 | 0.020978 | 0.023366 | -0.007098 | -0.396306 | 0.137284 | 0.086243 |
7462 rows × 300 columns
Fuente: Daniel Montenegro
Fuente: Altexsoft