TF-IDF is a composite weight for each word in each document.

This workflow example calculates TF-IDF and gets the top-k important keywords in each document.

The Treasure Workflow and TF-IDF weighting technique enable an analysis of a collection of massive documents on the cloud. After you find TF-IDF scores and top-k words for each document, you can use the results in a wide variety of applications, such as document clustering and recommendations.

Input

This workflow takes a table of the following form:

docid

long

contents

string

1

Justice, in its broadest context,…

2

Wisdom (sophia) is the ability to think …

Workflow

You can use a prepared a basic workflow for TF-IDF calculation and to get top-k important keywords in each document.

You must execute stopwords.sh to create stopwords table, or if you use your own stopwords table, you should rewrite the table name in config/params.yml.

Stop words are words such as the, is, at, which. These words are filtered out before or after processing of natural language data.

  1. Prepare the sample data set:
    $ ./data.sh

  2. Prepare the stopworlds table:
    $ ./stopwords.sh

  3. Push the workflow into Treasure Data:
    $ td wf push tfidf

  4. Run the workflow:
    $ td wf start tfidf tfidf --session now -p apikey=${YOUR_TD_API_KEY}

  • tfidf.dig – TD workflow script for TF-IDF calculation and getting top-k important keywords in each document.

  • config/params.yml – defines configurable parameters for the TF-IDF workflow such as k of top-k (default: 3), language of the documents (english or japanese, default: english).

Sample Workflow Output

The outputs of the workflow are the following tables:

  • collected

  • top_k

Because the sample dataset contains only three documents and the query to get the top-k keywords in this workflow selects words that occur at least twice in documents, the first row of the output table has only 2 words.

The collected table contains a list of words and the TF-IDF for each document:

docid

long

tfidf

array<string>

1

[“justice:0.1758477689430746”,“based:0.07033910867777095”,…]

2

[“action:0.08688948589543341”,“wisdom:0.06516711579725143”,…]

Thetop_k table contains a list of top-k keywords for each document:

docid

long

keywords

array<string>

1

[“philosophy”,“study”]

2

[“experience”,“knowledge”,“understanding”]




  • No labels