Buried Deep in Data? New Search Options Can Help

Private Investigator Data Mining and Information Research TechniquesIt could have happened to any insurance company at any time. Joe, not his real name, claimed he was disabled after a one-car accident. No one else was involved.

Who’s to say he was wrong? Joe had the paperwork. His story sounded real. But that’s when Michele Stuart got involved.

She’s been a private investigator for the last 18 years and now owns JAG Investigations, in Gilbert, AZ. She also gives lectures all over the country on digital information.

“The Internet provides a… personal insight into people’s lives,” she said.

Several key strokes proved it in Joe’s case… Michele found his “My Space” page with pictures he took the day after his “accident”. They showed him in front of the wrecked car. He was smiling. Other pictures show him playing baseball—after the accident.

“People post thousands of pictures on their ‘My Space’ page, Michele explains. “They don’t realize it was searchable.”

Start with the Wide Angle

The web is expanding in content and complexity. According to Michele, the Internet currently has over 800 billion indexable Web pages but only a fraction can be searched.

Part of the problem is form-related. Everything on the web is not a web page. More information is showing up in “semi structured” formats or tabular data like databases.

According to Suresh Chandrasekaran a standard web search can’t handle semi-structured data well.

Suresh is senior vice-president at Denodo Technologies in Palo Alto, CA. They help companies with “Mashup” technology that summarizes unstructured data into a table.

Most people call a standard search like Google “successful,” he said if they find results in one of the top three results.

Even standard search engines will return different results.

“Never run it on one,” Michele tells investigators when they type in their search criteria. “There’s hundreds of search engines.”

Investigative work can be challenging on many levels. But the complexity of information is behind much of it. That complicates everything from simple paperwork to advanced searching on the web.

“The biggest ‘wish I had known’ was not actually in doing PI work but in the operation of the PI business,” said Kelly E. Riddle, TCI, TPLI from Texas. “I got a call last night from a retired DPS officer who is starting a PI business with a retired sheriff‘s deputy and they were wanting to know about client contracts, forms to use, sales tax, etc. These are just the basics and most don‘t have a clue…”

Other investigators see the same trend.

“I would say the gathering of data in general was something I had to learn on my own,” said Peter Psarouthakis with EWI & Associates from Chelsea, MI. “I get a lot of inquiries from just retired law enforcement people that want to know how they get certain types of information.”

“Most don’t realize that access to information is completely different in the private sector and there are costs associated with it.”

The old approach with printed materials was linear. You first gathered the data and then made sense of it. But the target is shifting in the digital world. Google now boasts about the ability to process one trillion web pages.

It can be difficult to keep up much less make sense of that data or use it for information retrieval. But the same technology behind web complexity now offers an opportunity for better speed and efficiency.

Group Relate Ideas

Clusty.com is an example. It’s a search engine that takes your entry and divides the data into categories of meaning called “clusters.” That helps to narrow a search according to meaning instead of working your way through a list of topics that appear according to popularity.

They even built in a feature called “find in clusters” where you can search for connections within your results.

I tried their service but noticed a longer wait time than with Google. It’s nice to see the information divided into categories off to the side in case you want to look in a different direction.

Google and Exalead.com both offer proximity searching. For Google, type in a name followed by an asterisk (NEAR in capital letters for Exalead) and another term and you can get a list of documents with those terms divided by 15 words.

This is helpful when there’s a relationship between a person and problem. I tried that combination with my own name followed by an asterisk and the word “writer“. Several pages returned with links to my articles or books. Most of my work was in that list but not all of it.

Always use numerous sites, not just search engines, to develop a complete source of information. For instance, use a search engine for a broad profile, then use sites such as technorati.com, icerocket.com or myspace.com for more specific searches, advises Michele. You’ll get different results. It even helps to break the search terms up in the directories.

Each group has their own method to “index” content. Exalead.com has a feature like Google’s proximity search. Click on “near” at the top of their tool bar and it will return entries between terms that are within 16 words of each other.

That could help when searching for a suspect when there’s a link to “extortion” or
“tax fraud.”

Their “next” feature will find terms that are side by side.

Zoominfo.com is another search engine that specializes in business related searches that total “42 million people and 3.9 million companies.

I tried their free “people search” with my own name and only got several returns related to teaching jobs and several articles.

Yoname.com and Icerocket.com will search blogs. That includes myspace.com. This can be helpful in targeting personal information in blog format.

The search engine lococitato.com extends that in a service called “My Space Visualizer” that will map relationships in a diagram. That might be helpful to see the people connected to your search target. Loco Citato is Latin for “In the place previously cited.”

There are search engines devoted to non profits like Guidestar.org. That can make it easier if your target was connected to those organizations.

‘Google’ up Some Competition

Cuil.com recently came out as a major search engine. Anna Patterson helped the start up as a former architect for Google’s search index TeraGoogle. The word “Cuil” is pronounced “COOL” and comes from a Gaelic word for knowledge.

They claim to have crawled 186 billion web pages with 120 billion in their index. Their approach puts the focus on relationship in your search terms. They don’t just rank sites on the order of appearance.

The first thing you notice is the format. It combines graphics in the search results.

“The magazine style content takes some getting used to but in my experience the ‘search quality’ is as good if not better than other search engines,” reported Wayne E. Halick, a Licensed Private Detective in Illinois and Agency Director of Millennium Investigations, Inc.

He rarely looks beyond the first few pages of Google before revising his search but likes the auto-complete feature. “Google offers this on their toolbar search and it is great especially if you can’t remember the exact name of the movie, person, or city, you are looking for,” he added.

“They’re going for volume,” said Cynthia Hetherington about Cuil. “The data’s there.”

Cynthia is the author of Business Background Investigations: Tools and Techniques for Solution Driven Due Diligence and often lectures on digital information retrieval.

She doesn’t like the lack of advanced searching tools in Cuil, though.

“What good is overwhelming the user with every possible result, when you do not give them the tools to actually cull through the inordinate amount of returned results?” she asks.

There’s a reason behind the increase in results. According to Michele, Cuil will reference search engines and directories, not just individual web pages.

“It searches out a larger volume than Google,” she explained.

Cuil includes a feature called “Explore by Category.” This is like Clusty.com in the grouping of results according to category.

Still, Cuil is new and growing. They’re bound to make changes in the adjustment process as they continue.

“Mash-up” for More Time

Dennis Drellishak Sr. had a problem recently. He’s president of Corporate Screening Services, Inc., in Middleburg Heights, Ohio. They do employment screening and background checks. That can get labor intensive with individuals having to sift through large amounts of data.

That’s when Dennis heard about mash-up technology. It’s a service that allows you to extract data from web sources and then reassemble it in a database along with other internal data. All this is automated in a program that runs 24/7.

“We can program the software to login to hundreds of website (Courts or proprietary sites) and run names,” reported Dennis.

He compares “one employee” searching 100 county courts or public records to the system that “can do it in a matter of minutes“.

“Once this system is fully integrated into our processes we will take initially about 10K a month off our overhead,” Dennis added. “This cost reduction will be a result of eliminating others services that charge us to search various courts and proprietary systems as well as labor cost.”

Suresh Chandrasekaran works for Denodo Technologies, the mash-up maker that Dennis is using. He said the program can take data from any format like databases, XML or free text and convert it in a way that allows the user to see relationships.

“You’re not cutting and pasting the name of a person and then putting it in a database,” Suresh added. “A lot of the heavy lifting of data extraction and consolidation is done automatically.”

That can be welcome news when you need information from hundreds of sources. Mash-ups allow you to automate the manual tasks and find patterns.

Dennis tied most state sex offender sites together into one search. In the past his staff had to do this manually or paid for it. They also automated the manual searching of OIG/GSA sites and tied 100 or so regional county and local court systems into one search in Ohio.

“Basically anything you consistently search on the Internet can be automated and compiled into one search that will pull back results into a format you want,” he said. “The possibilities for companies in our field to create new products and take control of and sort (Mash) large quantities of information are endless.”

“The web is a huge repository of info,” Suresh explained. The problem is how to get what you want and then relate it to things that you care about.

It makes you wonder about the future. The web continues to grow at an exponential rate in ways that can be difficult to get your mind around. But then the same technology that presents the problem also inspires the innovation with solutions.

That makes it easy to justify those new systems when you’re buried deep in data.

Clay Renick is a freelance feature writer from Statesboro, GA and has written many articles related to private investigation.