Thursday, January 20, 2022

Data, data everywhere..

One of the things that the software my company does and one of the core pieces of technology that I worked on myself in the early days of the firm (when it was just me!) was a method and pipeline of processing that unpicks news stories. When I say "unpicks" what I mean is that the software tries to identify companies, people, places and useful business topics within the text of each story, in other words it tries to "make sense" of the story for a particular audience. This is a lot harder than it sounds, there are so many different ways that the same things can be talked about in the English language and company/people names are hugely ambiguous and can be abbreviated or implied etc. There is a whole branch of computer science dedicated to this task called "Natural Language Processing" and it's interesting stuff! Anyway, I've always been impressed at the amount of data we process, it's a 24x7 operation with roughly two million new stories coming down the wire from around the world every day and we store roughly 10 years worth of stories so that when a new topic comes along we can go back through all that history and see which stories may be talking about it (essentially re-process the entire corpus) 

The numbers are scary, trillions of words, billions of articles and millions of companies are being looked for every day by a rack full of machines all whirring away in the dark. It's impressive, that is until I look at the volumes of data they have to deal within the "hard sciences", like physics. 


Just imaging a black hole, for example, takes 5 petabytes of data, that's mind boggling, our entire corpus (10 years worth of news) is only 3 terabytes or three trillion characters but the black hole data is over a thousand times bigger than this. Now that's what I call BIG data..


No comments: