It’s going to be a Wiki Christmas thanks to our partner BrightPlanet!December 19, 2016
Was Santa involved in Wikileaks too?
Our partner BrightPlanet, specializes in harvesting and enriching large open-source datasets. They love big data, and WikiLeaks’ continuously-expanding repository of leaked government documents is like an early Christmas gift! It’s also a great dataset to showcase their scalable harvesting technologies. Combining BrightPlanet’s capabilities with Rosette gives you the perfect tool to search, gather, and extract information from big data lakes. For our 25 days of Rosette, BrightPlanet’s team wrote a seasonally themed guest blog post showing how our two technologies work together:
In honor of the holiday season, we thought it would be fun to see if there were any interesting Christmas messages worthy of additional analytics buried within the 9.6+ million document WikiLeaks index.
What started as a lovely investigative journey into a political hot-button website turned into a horrifying slog through SPAM hell. If you’ve never dug into WikiLeaks emails before, the first thing you will notice is the tremendous amount of SPAM.
We estimate SPAM emails account for slightly less than 1 million documents, or 10 percent of the entire WikiLeaks database. Many of these messages came from the mid-2000s when SPAM filters were not nearly as robust. Using a combination of simple keyword filters, we narrowed the dataset down to 30,000 relevant documents which reference Christmas.
Initially, we were hoping to locate a few Christmas wish lists, but unfortunately government contractors and bureaucrats don’t use their work emails to delegate Christmas shopping duties. There were plenty of “Merry Christmas” well-wishing emails: 1,356 emails to be exact. However, we did find someone who emailed themselves a shopping reminder with a list of gift ideas.
In a different subset, we found a Stratfor “Secret Santa” email chain, which generated around 300 emails. Many of the emails were asking for gift ideas for the drawn names. Our favorite gift idea was military replica gear for members of the Stratfor tactical team. There are some interesting inert weapons available from the suggested website. We agree! Stratfor’s tactical team would love anything from this website.
We came across this funny story from the Secretary of State’s daily briefing. We thought it would be fun to process this document through the Basis Technology’s Rosette Text Analysis engine to extract out the entity data. And there you have it, the “North Pole” definitely exists as a location.
For over 15 years, BrightPlanet has helped clients simplify the web collection and sense-making process. As a small team located in South Dakota, we love helping our customers and partners find ways to make sense of the world’s largest database in existence, the internet. If you are interested in finding and harvesting large datasets on the web, you can schedule a consultation with one of our data acquisition engineers.’