MusaddiqueHussainLabs NLP: State-of-the-Art Natural Language Processing & LLMs Library
MusaddiqueHussainLabs is a comprehensive Natural Language Processing (NLP) library designed to offer state-of-the-art functionality for various NLP tasks. This Python package provides a range of tools and functionalities aimed at facilitating NLP tasks, document analysis, and text preprocessing.
Features
Currently the package is organized into three primary modules:
1. NLP Components
| Component Type | Description |
|---|---|
| tokenize | Text tokenization |
| pos | Part-of-Speech tagging |
| lemma | Word lemmatization |
| morphology | Study of word forms |
| dep | Dependency parsing |
| ner | Named Entity Recognition |
| norm | Text normalization |
2. Text Preprocessing
This module equips users with an extensive set of text preprocessing tools:
| Function | Description |
|---|---|
| to_lower | Convert text to lowercase |
| to_upper | Convert text to uppercase |
| remove_number | Remove numerical characters |
| remove_itemized_bullet_and_numbering | Eliminate itemized/bullet-point numbering |
| remove_url | Remove URLs from text |
| remove_punctuation | Remove punctuation marks |
| remove_special_character | Remove special characters |
| keep_alpha_numeric | Keep only alphanumeric characters |
| remove_whitespace | Remove excess whitespace |
| normalize_unicode | Normalize Unicode characters |
| remove_stopword | Eliminate common stopwords |
| remove_freqwords | Remove frequently occurring words |
| remove_rarewords | Remove rare words |
| remove_email | Remove email addresses |
| remove_phone_number | Remove phone numbers |
| remove_ssn | Remove Social Security Numbers (SSN) |
| remove_credit_card_number | Remove credit card numbers |
| remove_emoji | Remove emojis |
| remove_emoticons | Remove emoticons |
| convert_emoticons_to_words | Convert emoticons to words |
| convert_emojis_to_words | Convert emojis to words |
| remove_html | Remove HTML tags |
| chat_words_conversion | Convert chat language to standard English |
| expand_contraction | Expand contractions (e.g., "can't" to "cannot") |
| tokenize_word | Tokenize words |
| tokenize_sentence | Tokenize sentences |
| stem_word | Stem words |
| lemmatize_word | Lemmatize words |
| preprocess_text | Combine multiple preprocessing steps into one function |
3. Document Analysis
| Functionality | Description |
|---|---|
| Language | Detect document language |
| Linguistic Analysis | Resolve ambiguities |
| Key phrases | Retrieve relevant information from documents |
| NER | Named Entity Recognition |
| Sentiment | Analyze sentiment of text |
| PII Anonymization | Anonymize Personally Identifiable Information |
Prerequisites
- Python >= 3.9
- GOOGLE_API_KEY from Google AI Studio
- Place the API key in a
.envfile in the project root directory.
Installation
To install musaddiquehussainlabs, you can use pip:
pip install musaddiquehussainlabs
Usage
from musaddiquehussainlabs.nlp_components import nlp
from musaddiquehussainlabs.text_preprocessing import preprocess_text, preprocess_operations
from musaddiquehussainlabs.document_analysis import DocumentAnalysis
data_to_process = "The employee's SSN is 859-98-0987. The employee's phone number is 555-555-5555."
# Using NLP component
result = nlp.predict(component_type="ner", input_text=data_to_process)
print(result)
# Text preprocessing
preprocessed_text = preprocess_text(data_to_process)
print(preprocessed_text)
# Custom Text preprocessing
preprocess_functions = [preprocess_operations.to_lower]
preprocessed_text = preprocess_text(data_to_process, preprocess_functions)
print(preprocessed_text)
# Document analysis
document_analysis = DocumentAnalysis()
# Option 1: full analysis
result = document_analysis.full_analysis(data_to_process)
# Option 2: Individual document analysis
result = document_analysis.pii_anonymization(data_to_process)
print(result)
Feel free to explore more functionalities and customize the usage based on your requirements!